Postulates of a Personalistic 


The seven postulates (P1 through P7) scattered through the first 
five chapters of this book are reproduced here for ready reference along 
with a minimum of explanatory material. The language of the postu- 
lates is here changed somewhat for conciseness and to show an alterna- 
tive mode of expression, but the logical content of each postulate is 
left unaltered. 

The formal subject matter of the theory 

The states, a set S of elements s, s’, --- with subsets A, B,C, --- (page 11). 
The consequences, a set F of elements f, g, h, -++ (page 14). 

Acts, arbitrary functions f, g, h, --- from S to F (page 14). 


The relation “is not preferred to” between acts, < (page 18). 


The postulates, and definitions on which they depend 


Definitions of terms not in general mathematical use are ‘given here 
as D1 through D5; for others consult the General Index (page 289) 
and the Technical Symbols (page 283). 


Pl The relation < is a simple ordering (page 18). ia 


D1 f <g given B, if and only if f’ < g' for every f’ and g’ that 
agree with f and g, respectively, on B and with each other on ~B 
(page 22). 
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Theory of Decision 
P2 For every f, g, and B, f < g given B org < f given B (page 23). 


D2 g < g'; if and only if f < f’, when f(s) = g, f'(s) = g' for every 
s eS (page 25). s 


D3 B is null, if and only if f < g given B for every f, g (page 24). 


P3 If f(s) = g, J'(s) = g' for every s eB, and B is not null; then 
f < f' given B, if and only if g < g’ (page 26). 


D4 <A <B; if and only if fa < fs or g < g' for every fa, fs, 9, 9 
such that: fa(s) =g for s eA, fa(s) = g' for se~A, fa(s) = g, for 
s e B, fg(s) = g' for s e ~B (page 31). 


P4 For every A, B, A < B or B < A (page 31). 
P5 Itis false that, for every f, J’, f <J’ (page 31). 


P6 Suppose it false that g < h; then, for every f, there is a (finite) 
partition of S such that, if g’ agrees with g and h’ agrees with h except 
on an arbitrary element of the partition, g’ and h’ being equal to f 
there, then it will be false that g’ < h or g <h’ (page 39). 


D5 £<g given B (g <f given B); if and only if f < h given B 
(h < f given B), when A(s) = g for every s (page 72). 


P7  Iff< g(s) given B (g(s) <f given B) for every s eB, then 
f < g given B (g < f given B) (page 77). 
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Preface 


A BOOK ABOUT SO CONTROVERSIAL A SUBJECT AS THE FOUNDATIONS 
of statistics may have some value in the classroom, as I hope this one 
will; but it cannot be a textbook, or manual of instruction, stating the 
accepted facts about its subject, for there scarcely are any. Openly, or 
coyly screened behind the polite conventions of what we call a disinter- 
ested approach, it must, even more than other books, be an airing of 
its author’s current opinions. 

One who so airs his opinions has serious misgivings that (as may be 
judged from other prefaces) he often tries to communicate along with 
his book. First, he longs to know, for reasons that are not altogether 
noble, whether he is really making a valuable contribution. His own 
conceit, the encouragement of friends, and the confidence of his pub- 
lisher have given him hope, but he knows that the hopes of others in 
his position have seldom been fully realized. 

Again, what he has written is far from perfect, even to his biased 
eye. He has stopped revising and called the book finished, because 
one must sooner or later. 

Finally, he fears that he himself, and still more such public as he 
has, will forget that the book is tentative, that an author’s most recent 
word need not be his last word. 

The application of statistics interests some workers in almost every 
field of empirical investigation—not only in science, but also in com- 
merce and industry. Moreover, the foundations of statistics are con- 
nected conceptually with many disciplines outside of statistics itself, 
particularly mathematics, philosophy, economics, and psychology—a 
situation that, incidentally, must augment the natural misgivings of 
an author in this field about his own competence. Those who read in 
this book may, therefore, be diverse in background and interests. With 
this consideration in mind, I have endeavored to keep the book as free 
from technical prerequisites as its subject matter and its restriction to 
a reasonable size permit. 

Technical knowledge of statistics is nowhere assumed, but the reader 
who has some general knowledge of statistics will be much better pre- 
pared to understand and appraise this book. The books Statistics, by 
L. H. C. Tippett, and On the Principles of Statistical Inference by 
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A. Wald, listed in the Bibliography at the end of Appendix 3, are short 
authoritative introductions to statistics, either of which would provide 
some statistical background for this book. The books of Tippett and 
Wald are so different in tone and emphasis that it would by no means 
be wasteful to read them both, in that order. 

Any but the most casual reader should have some formal preparation 
in the theory of mathematical probability. Those acquainted with 
moderately advanced theoretical statistics will automatically have this 
preparation; others may acquire it, for example, by reading Theory of 
Probability, by M. E. Munroe, or selected parts of An Introduction to 
Probability Theory and Its Applications, by W. Feller, according to 
their taste. In Feller’s book, a thorough reading of the Introduction 
and Chapter 1, and a casual reading of Chapters 5, 7, and 8 would be 
sufficient. 

The explicit mathematical prerequisites are not great; a year of cal- 
culus would in principle be more than enough. But, in practice, read- 
ers without some training in formal logic or one of the abstract branches 
of mathematics usually taught only after calculus will, I fear, find some 
of the long though elementary mathematical deductions quite forbid- 
ding. For the sake of such readers, I therefore take the liberty of giv- 
ing some pedagogical advice here and elsewhere that mathematically 
more mature readers will find superfluous and possibly irritating. In 
the first place, it cannot be too strongly emphasized that a long mathe- 
matical argument can be fully understood on first reading only when it 
is very elementary indeed, relative to the reader’s mathematical knowl- 
edge. If one wants only the gist of it, he may read such material once 
only; but otherwise he must expect to read it at least once again. Seri- 
ous reading of mathematics is best done sitting bolt upright on a hard 


chair at a desk. Pencil and paper are nearly indispensable; for there 
are always figures to be sketched and 
fied by calculation. In this book, a 
when exercises are indicated 


be general intelligibility and 
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no real difficulty in re- 
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Few will wish to read the whole book; therefore introductions to the 
chapters and sections have been so written as not only to provide orien- 
tation but also to facilitate skipping. In particular, safe detours are 
indicated around mathematically advanced topics and other digressions. 

A few words in explanation of the conventions, such as those by which 
internal and external references are made in this book, may be useful. 

The abbreviation § 3.4 means Section 4 of Chapter 3; within Chapter 
3 itself, this would be abbreviated still further to § 4. The abbreviation 
(3.4.1) means the first numbered and displayed equation or other ex- 
pression in § 3.4; within Chapter 3, this would be abbreviated still 
further to (4.1) and within § 3.4 simply to (1). Theorems, lemmas, 
exercises, corollaries, figures, and tables are named by a similar system, 
e.g., Theorem 3.4.1, Theorem 4.1, Theorem 1. Incidentally, the proofs 
of theorems are terminated with the special punctuation mark @, a 
device borrowed from Halmos’ Measure Theory. 

Seven postulates, P1, P2, etc., are introduced over the course of 
several chapters. For ready reference these are, with some explanatory 
material, reproduced on the end papers. 

Entries in the Bibliography at the end of Appendix 3 are designated 
by a self-explanatory notation in square brackets. For example, the 
works of Tippett, Wald, Munroe, Feller, and Halmos, already’ referred 
to, are [T2], [W1], [M6], [F1], and [H2], respectively. 

I often allude to a set of key references to a given topic. This means 
a set of external references intended to lead the reader that wishes to 
pursue that particular topic to the fullest and most recent bibliographies; 
it has nothing to do with the merit or importance of the works referred to. 

Technical terms (except for non-verbal symbols) that are defined in 
this book are printed in bold face or italics (depending on the impor- 
tance of the term for this book or for established usage) in the context 
where the term is defined. These special fonts are occasionally used 
for other purposes as well. Terms are sometimes used informally— 
even in unofficial definitions—before being officially defined. Even the 
official definitions are sometimes of necessity very loose, corresponding 
to the well-known principle that, in a formal theory, some terms must 
in strict logic be left undefined. 

L. J. SAVAGE 

University of Chicago 

April, 1954 
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CHAPTER 1 


Introduction 


1 The role of foundations 

It is often argued academically that no science can be more secure 
than its foundations, and that, if there is controversy about the foun- 
dations, there must be even greater controversy about the higher parts 
of the science. As a matter of fact, the foundations are the most con- 
troversial parts of many, if not all, sciences. Physics and pure mathe- 
matics are excellent examples of this phenomenon. As for statistics, 
the foundations include, on any interpretation of which I have ever 
heard, the foundations of probability, as controversial a subject as one 
could name, As in other sciences, controversies over the foundations 
of statistics reflect themselves to some extent in everyday practice, but 
not nearly so catastrophically as one might imagine. I believe that 
here, as elsewhere, catastrophe is avoided, primarily because in prac- 
tical situations common sense generally saves all but the most pedantic 
of us from flagrant error. It is hard to judge, however, to what extent 
the relative calm of modern statistics is due to its domination by a 
vigorous school relatively well agreed within itself about the foundations. 

Although study of the foundations of a science does not have the 
role that would be assigned to it by naive first-things-firstism, it has a 
certain continuing importance as the science develops, influencing, and 
being influenced by, the more immediately practical parts of the science. 


2 Historical background 

The concept and problem of inductive inference have been promi- 
hent in philosophy at least since Aristotle. Mathematical work on some 
aspects of the problem of inference dates back at least to the early 
eighteenth century. Leibniz is said to be the first to publish a sugges- 
tion in that direction, but Jacob Bernoulli’s posthumous Ars Conjec- 
tandi (1713) [B12] seems to be the first concerted effort.} This mathe- 

t Valuable information on this and other topics of the early philosophic history of 
Probability is attractively presented in Keynes’ treatise [K4], especially in Chapters 


VII, XXIII, and the bibliography. 
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matical work has always revolved around the concept of probability; 
but, though there was active interest in probability for nearly a cen- 
tury before the publication of Ars Conjectandi, earlier activity seems 
not to have been concerned with inductive inference. 

In the present century there has been and continues to be extra- 
ordinary interest in mathematical treatment of problems of inductive 
inference. For reasons I cannot and need not analyze here, this ac- 
tivity has been strikingly concentrated in the English-speaking world. 
It is known under several names, most of which stress some aspect of 
the subject that seemed of overwhelming importance at the moment 
when the name was coined. “Mathematical statistics,” one of its 
earliest names, is still the most popular. In this name, “mathematical” 
seems to be intended to connote rational, theoretical, or perhaps mathe- 
matically advanced, to distinguish the subject from those problems of 
gathering and condensing numerical data that can be considered apart 
from the problem of inductive inference, the mathematical treatment 
of which is generally relatively trivial. The name “statistical inference” 
recognizes that the subject is concerned with inductive inference. The 
name “statistical decision” reflects the idea that inductive inference is 
not always, if ever, concerned with what to believe in the face of in- 
conclusive evidence, but that at least sometimes it is concerned with 
what action to decide upon under such circumstances. Within this 
book, there will be no harm in adopting the shortest possible name, 
“statistics.” 

It is unanimously agreed that statistics de 
bility. But, as to what probability is and how it is connected with 
statistics, there has seldom been such complete disagreement and break- 
down of communication since the Tower of Babel. There must be 
dozens of different interpretations of probability defended by living 
authorities, and some authorities hold that several different interpreta- 
tions may be useful, that is, that the concept of probability may have 
different meaningful senses in different contexts. Doubtless, much of 
the disagreement is merely terminological and would disappear under 
sufficiently sharp analysis. Some believe that it would all disappear, 
or even that they have themselves already made the n 
analysis. 

Considering the confusion about the f 
surprising, and certainly gratifying, 
agreed on what the purely mathemati 
Virtually all controversy therefore ce 
the generally accepted axiomatic con 
termining the extramathematical pro; 


pends somehow on proba- 


ecessary 


oundations of statistics, it is 
to find that almost everyone is 
ical properties of probability are. 
nters on questions of interpreting 
cept of probability, that is, of de- 
perties of probability. 
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The widely accepted axiomatic concept referred to is commonly as- 
cribed to Kolmogoroff [K7] and goes by his name. It should be men- 
tioned that there is some dissension from it on the part of a small group 
led by von Mises [V2]. There are also a few minor technical variations 
on the Kolmogoroff system that are sometimes of interest; they will be 
discussed in § 3.4. 

I would distinguish three main classes of views on the interpretation 
of probability, for the purposes of this book, calling them objectivistic, 
personalistic, and necessary. Condensed descriptions of these three 
classes of views seem called for here. If some readers find these descrip- 
tions condensed to the point of unintelligibility, let them be assured 
that fuller ones will gradually be developed as the book proceeds. 

Objectivistic views hold that some repetitive events, such as tosses 
of a penny, prove to be in reasonably close agreement with the mathe- 
matical concept of independently repeated random events, all with the 
same probability. According to such views, evidence for the quality 
of agreement between the behavior of the repetitive event and the 
mathematical concept, and for the magnitude of the probability that 
applies (in case any does), is to be obtained by observation of some 
repetitions of the event, and from no other source whatsoever. 

Personalistic views hold that probability measures the confidence 
that a particular individual has in the truth of a particular proposition, 
for example, the proposition that it will rain tomorrow. These views 
postulate that the individual concerned is in some ways “reasonable,” 
but they do not deny the possibility that two reasonable individuals 
faced with the same evidence may have different degrees of confidence 
in the truth of the same proposition. i 

Necessary views hold that probability measures the extent to which 
one set of propositions, out of logical necessity and apart from human 
Opinion, confirms the truth of another. They are generally regarded 
by their holders as extensions of logic, which tells when one set of prop- 
ositions necessitates the truth of another. ; 

After what has been said about the intensity and complexity of the 
Controversy over the probability concept, you must realize that the 
short taxonomy above is bound to infuriate any expert on the founda- 
tions of probability, but I trust it may do the less learned more good 
than harm. ; , i 

The great burst of statistical research in the English-speaking world 
in the present century has revolved around objectivistic views on the 
interpretation of probability. As will shortly be explained, any purely 
objectivistic view entails a severe difficulty for statistics. This diffi- 
culty is recognized by members of the British-American School, if I 
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may use that name without its being taken too literally or at all na- 
tionalistically, and is regarded by them as a great, though not insur- 
mountable, obstacle; indeed, some of them see it as the central problem 
of statistics. 

The difficulty in the objectivistie position is this. 
vistic view, probabilities can apply fruitfully only to r 
that is, to certain processes; and (depending on the 
it is either meaningless to talk about the probability th 
sition is true, or this probability can be only 1 or 0, 
proposition is in fact true or false. 
probability serve as a measure of the 
Thus the existence of evidence for a 
jectivistic view, be expressed by saying that the proposition is true with 
a certain probability. Again, if one must choose among several courses 
of action in the light of experimental evidence, it is not meaningful, in 
terms of objective probability, to compute which of these actions is 
most promising, that is, which has the highest expected income. Hold- 
ers of objectivistic views have, therefore, no recourse but to argue that 
it is not reasonable to assign probabilities to the truth of propositions 
or to calculate which of several actions is the most promising, and that 


the need expressed by the attempt to set up such concepts must be 
met in other ways, if at all. 
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is presented in a tentative spirit, for I realize that the serious blemishes 
in it apparent to me are not the only ones that will be discovered by 
critical readers. A theory of the foundations of statistics that appears 
contrary to the teaching of the most productive statisticians will prop- 
erly be regarded with extraordinary caution. Other views on proba- 
bility will, of course, be discussed in this book, partly for their own in- 
terest and partly to explain the relationship between the personalistic 
view on which this book is based and other views. 

The book is organized into seventeen chapters, of which the present 
introduction is the first. Chapters 2-7 are, so to speak, concerned with 
the foundations at a relatively deep level. They develop, explain, and 
defend a certain abstract theory of the behavior of a highly idealized 
person faced with uncertainty. That theory is shown to have as im- 
plications a theory of personal probability, corresponding to the per- 
sonalistic view of probability basic to this book, and also a theory of 
utility due, in its modern form, to von Neumann and Morgenstern 
[V4]. 
There is a transition, occurring in Chapter 8 and maintained through- 
out the rest of the book, to a shallower level of the foundations of sta- 
tistics; I might say from pre-statistics to statistics proper. In those 
later chapters, it is recognized that the theory developed in the earlier 
ones is too highly idealized for immediate application. Some compro- 
mises have to be made, and the appropriate ones are sought in an anal- 
ysis of some of the inventions and ideas of the British-American School. 
It will, I hope, be demonstrated thereby that the superficially incom- 
patible systems of ideas associated on the one hand with a personalistic 
view of probability and on the other with the objectivistically inspired 
developments of the British-American School do in fact lend each other 


mutual support and clarification. 


CHAPTER 2 


Preliminary Considerations 
on Decision in 


the Face of Uncertainty 


1 Introduction 


Decisions made in the face of uncertainty pervade the life of every 
individual and organization. Even animals might be said continually 
to make such decisions, and the psychological mechanisms by which 
men decide may have much in common with those by which animals 
do so. But formal reasoning presumably plays no role in the decisions 
of animals, little in those of children, and less than might be wished in 
those of men. It may be said to be the purpose of this book, and in- 
deed of statistics generally, to discuss the implications of reasoning for 
the making of decisions. 

Reasoning is commonly associated with logic, but it is obvious, as 
many have pointed out, that the implications of what is ordinarily 
called logic are meager indeed when uncertainty is to be faced. It has 
therefore often been asked whether logic cannot be extended, by prin- 
ciples as acceptable as those of logic itself, to bear more fully on un- 
certainty. An attempt to extend logic in this way will be begun in 
this chapter, differing in two important respects from most, but not 
all, other attempts. 

First, since logic is concerned with im 
many have thought it natural to extend 
the extent to which one proposition tends to imply, or provide evidence 
for, another. It seems to me obvious, however, that what is ultimately 
wanted is criteria for deciding among possible courses of action; and, 
therefore, generalization of the relation of implication seems at best a 


roundabout method of attack, Tt must be admitted that logic itself 
does lead to some criteria for decisi i i 


proposition known to be true is i 
making a decision. Should so 
monstrably even better articul 


plications among propositions, 
logic by setting up criteria for 


ated with decision than is implication it- 
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self, that would be excellent; but how is such a notion to be sought ex- 
cept by explicitly studying decision? Ramsey’s discussion in [R1] of 
the point at issue here is especially forceful. 

Second, it is appealing to suppose that, if two individuals in the same 
situation, having the same tastes and supplied with the same informa- 
tion, act reasonably, they will act in the same way. Such agreement, 
belief in which amounts to a necessary (as opposed to a personalistic) 
view of probability, is certainly worth looking for. Personally, I be- 
lieve that it does not correspond even roughly with reality, but, hav- 
ing at the moment no strong argument behind my pessimism on this 
point, I do not insist on it. But I do insist that, until the contrary be 
demonstrated, we must be prepared to find reasoning inadequate to 
bring about complete agreement. In particular, the extensions of logic 
to be adduced in this book will not bring about complete agreement; 
and whether enough additional principles to do so, or indeed any addi- 
tional principles of much consequence, can be adduced, I do not know. 
It may be, and indeed I believe, that there is an element in decision 
apart from taste, about which, like taste itself, there is no disputing. 

The next four sections of this chapter build up a formal model, or 
scheme, of the situation in which a person is faced with uncertainty; 
the final two, in terms of this model, motivate and state some of the 
few principles that seem to me entitled to be taken as postulates for 


rational decision. 


2 The person 

I am about to build up a highly idealized theory of the behavior of a 
“rational” person with respect to decisions. In doing so I will, of course, 
have to ask you to agree with me that such and such maxims of behavior 
are “rational.” In so far as “rational” means logical, there is no live 
question; and, if I ask your leave there at all, it is only as a matter of 
form.+ But our person is going to have to make up his mind in situa- 
tions in which criteria beyond the ordinary ones of logic will be neces- 
sary. So, when certain maxims are presented for your consideration, 
you must ask yourself whether you try to behave in accordance with 
them, or, to put it differently, how you would react if you noticed your- 


self violating them. 

erson’s behavior is logical is, of course, far from vacuous. 
nnot be uncertain about decidable mathematical prop- 
me, that the tempting program sketched by Polya 
thematical conjectures cannot 


t The assumption that a pi 
In particular, such a person ca 
Ositions. This suggests, at least to tl 
[P6] of establishing a theory of the probability of mal à t 
be fully successful in that it cannot lead to a truly formal theory, but de Finetti 


[D5] seems more optimistic about the program. 
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It is brought out in economic theory that organizations sometimes 
behave like individual people, so that a theory originally intended to 
apply to people may also apply to (or may even apply better to) such 
units as families, corporations, or nations. In view of this possibility, 
economic theorists are sometimes reluctant to use the word “person,” 
or even “individual,” for the behaving units to which they refer; but 
for our purpose “person” threatens no confusion, though the possi- 
bility of using it in an extended sense may well be borne in mind. 


3 The world, and states of the world 


A formal description, or model, of what the person is uncertain about 
will be needed. To motivate this formal description, let me begin in- 
formally by considering a list of examples. The person might be un- 
certain about: 

1. Whether a particular egg is rotten. 

2. Which, if any, in a particular dozen eggs are rotten. 

3. The temperature at noon in Chicago yesterday. 

4. What the temperature was and will be in the place now covered 
by Chicago each noon from January 1, 1 A.D., to January 1, 4000 A.D. 

5. The infinite sequence of heads and tails that will result from re- 
peated tosses of a particular (everlasting) coin. 

6. The complete decimal expansion of r. 

7. The exact and entire past, present, and future history of the uni- 
verse, understood in any sense, however wide. 

These examples have a few features in common, though, if there are 
more than a few, it is a discredit to my imagination. 
there is some object about which the person is uncertain, an egg, a 
dozen eggs, a temperature, a sequence of temperatures, ete. Each ob- 
ject admits a certain class of descriptions that might thinkably apply 
to it. To illustrate, the egg of Example 1 might be rotten or not; and 
the terms of the example are meant to exclude any other description 
from consideration, though, of course, a real egg has many other fea- 
tures. Again, since any subset of the dozen eggs (including the extreme 
cases of all and none at all) might be rotten, there are 212 descriptions 
associated with Example 2. For Example 3 and each subsequent one, 
there are an infinite number of descriptions, though the array of de- 
scriptions is more complicated in some than in others, reaching the ulti- 
mate of complexity in Example 7. Example 6 is a little anomalous 
in that anything the person does not know about the description of + 
he could know in principle by thinking sufficiently hard about it, that 
is, by logic alone. This point, banal to some readers, needs explanation 


Thus, in each 
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for others. If, for example, x is understood to be the area of a circle of 
unit radius, it follows by logic alone that ~ is not greater than the area 
of a square circumscribing the unit circle, that is, r < 4. By an elabo- 
ration of this method + can be computed to any degree of accuracy, 
and by other purely logical methods many other facts about + can be 
established, such as the fact that r is not a rational number. 

In connection with the concepts suggested by the preceding para- 
graph, the following nomenclature is proposed as brief, suggestive, and 
in reasonable harmony with the usages of statistics and ordinary dis- 
course. 

Definition 
the object about which the person is 
concerned 
a description of the world, leaving no 
relevant aspect undescribed 
the true state (of the world) the state that does in fact obtain, i.e., 
the true description of the world 


Term 
the world 


a state (of the world) 


In application of the theory, the question will arise as to which world 
to use in a given context. Thus, if the person is interested in the only 
brown egg in a dozen, should that egg or the whole dozen be taken as 
the world? It will be seen as the theory is developed that in principle 
no harm is done by taking the larger of two worlds as a model of the 
situation. One is therefore tempted to adopt, once and for all, one 
world sufficiently large, say Example 7. The most serious objection to 
this is that Example 7 is vague, and some mathematical and philosophi- 
cal experience suggests that the vagueness cannot be removed without 
ruining the universality of the example. It may also be added that the 
use of modest little worlds, tailored to particular contexts, is often a 
Simplification, the advantage of which is justified by a considerable 
body of mathematical experience with related ideas. 

The sense in which the world of a dozen eggs is larger than the world 
of the one brown egg in the dozen is in some respects obvious. It may 
be well, however, to emphasize that a state of the smaller world corre- 


sponds not to one state of the larger, but to a set of states. Thus, 
“The brown egg is rotten” describes the smaller world completely, and 
therefore is a state of it; but the same statement leaves much about the 


larger world unsaid and corresponds to a set of 21! states of it. In the 
Sense under discussion a smaller world is derived from a larger by neg- 
lecting some distinctions between states, not by sgapning some sate 
Cutright, The latter sort of contraction may be useful in case certain 
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states are regarded by the person as virtually impossible so that they 
can be ignored. 


4 Events 


An event is a set of states. For example, in connection with the 
world of Example 2, the person might well be concerned with the event 
that exactly one egg in the dozen is rotten (an event having 12 states 
as elements), or, a little less academically, that at least one of the eggs 
is rotten (an event having 2'? — 1 states as elements, i.e., all the states 
in the world but one). In connection with the world of Example 3, 
the person might be concerned with the event, having an infinite num- 
ber of states, that the temperature at noon in Chicago yesterday was 
below freezing. To give a final illustration, of a more mathematical 
flavor, consider in connection with Example 5 the event that the ratio 
of the number of heads to tails approaches 3 as the sequence progresses 
to infinity. 

In connection with any given world, there are two events that are 
of the utmost logical importance, though in ordinary discourse it may 
seem banal even to mention their existence. These are the universal 
and the vacuous events. The universal event, here to be symbolized 
by S, is the event having every state of the world as element. In so 
far as “world” has a real technical meaning, S is the world. The vacu- 
ous event, which can here be safely enough symbolized by the 0 of 
arithmetic, is the event having no states as elements. To illustrate, in 
Example 1 the event that the egg is rotten or good is the universal 
event, and that it is both rotten and good is the vacuous event. 

It is important to be able to express the ide 
tains the true state among its elements. 
no alternative to the rather stuffy expression, “the event obtains.” 

The theory under development makes no formal reference to time. 
In particular, the concept of event as here formulated is timeless, though 
temporal ideas may be employed in the description of particular events. 
Thus, it would not be said that Lincoln’s assassination is an event that 
occurred in 1865 and that the next return of Halley’s comet is one that 
will occur in 1985, but that Lincoln’s assassination in 1865 and the 
week of Halley’s comet in, but not before, 1985 are events that 
obtain. 


Modern mathematical usage, es 
matics called Boolean algebra, 


tions in connection with the concepts of state and event, 
these are synonyms, others abbreviations, 
compounded out of old. 


a that a given event con- 
English usage seems to offer 


pecially that of a branch of mathe- 
suggests the following table of defini- 


Some of 
and still others new terms 


2a] EVENTS u 


Though the notations introduced in Table 1 are very elementary 
and of great utility, they are not ordinarily taught except in connec- 
tion with logic or relatively advanced mathematics. A set of exercises 
illustrating their use is therefore given below in the form of a numbered 
list of statements. These statements are true whatever the sets A, B, 


TABLE 1. 


Term 
(Basic terms) 


(Relations) 
sed, 
ACB (o BDA). 


A=B. 


(Constructs) 


the complement of A with 
respect to S 

~A 

the union of the A,’s 


U: A; 
AUB 


the intersection of the Ai’s 


Nid: 
ANB 


MATHEMATICAL NOMENCLATURE PERTAINING TO STATE AND EVENTS 


Definition 


event 

generic symbols for events 
generic symbols for states 
the universal event 

the vacuous event 


sis an element of A, i.e., a state in A.t 

A is contained in B, i.e., every element 
of A is an element of B. 

A equals B, i.e., A is the same set as B, 
i.e., A and B have exactly the same 
elements. 


those elements of S that are not in A 


the complement of A with respect to S 

those elements of S that are elements 
of at least one of the sets A1, A», ete. 

the union of the A,’s 

the union of A and B, i.e., those ele- 
ments of S that are elements of A or 
B (possibly of both) 

those elements of S that are elements 
of each of the sets A1, Ae, ete. 

the intersection of the A;’s 

the intersection of A and B, i.e., those 
elements of S that are elements of 


both A and B 


orson font of the Greek alphabet (æ, 8, v, 5, 6 $, +++) 


t Typographical note: The P 
graphical note: n America, when mathematical constants 


is the one almost alwa inted, at least i 
a ys printed, a ae 
and variables are denoted by Greek letters. The symbol e used in this and some other 
Publications to denote “clement of” is, however, the epsilon of the Vertical font 
the special symbol €; and some use «e, 


semblance to €. The latter usage 


entails either using e for two different purposes or else changing fonts in mid alphabet 
a B, Y, 6, e, ¢, ++) when constants and variables are denoted by Greek letters. 
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C may be. Mathematicians would for the most part verify them by 
translating them into English and appealing to common sense, though 
in complicated cases explicit use might be made of Exercise 9. Dia- 
grams, called Venn diagrams, in which sets are symbolized by areas, 
as illustrated by Figure 1, are often suggestive.. 


~(AUB) 
Figure 1 


It is a remarkable and useful fact that any universally valid state- 
ment about sets remains so if, throughout, U is interchanged with N, 
0 with S, and C with D. The dual in this sense of each exercise should 
be studied along with the exercise itself. For example, the dual of 
Exercise 7 is: A D B, if and only if A = A U B. Note that the first 
parts of Exercises 1 through 6 are dual to the second parts. 

It may be remarked that, if Exercises 1-6 are taken as axioms and 
7 as a definition, Exercises 8-21 and also the duality principle follow 
formally from them. For example, 10 can be proved thus: By 7, if 
A N B is A, then A C B; but, by 1, A N A is A; therefore A c A. 
Again, 8 can be proved, using 6, 3, 2, 1, 3, and 6 in that order, thus: 


M ONA=(ANAANA 


(WAN AD NA 

=VAN(ANA)=~ANA=ANAA HO, 
Such formal demonstration is fun and helps develop mathematical skill. 
In the present exercises the novice, however, 


I ] should consider it as a 
possible supplement to, but not as a substitute for, demonstration by 
interpretation. 


If the exercises fail to render the notations familiar, it would be best 
to talk with someone to whom they are already familiar or failing that, 
to read in any elementary book where the subject is treated, for ex- 
ample, Chapter II, “The Boole-Schroeder Algebra,” in the text of 
Lewis and Langford [L7]. 


Exercises illustrating Boolean algebra 


LANA=A=AUA., 
2(ANB)NC=an(Bng; (AUB)UC=AU(B 
(These facts often render parenth e 


eses superfluous.) 
3 ANB=BNA;AUB=BU A. 
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4,.AN (BUC) =(ANB)U(ANC);AU(BNO= 


a~ 


A U B) N (A U O). 

5. SNA=A;0OUA=A. 

6. A N (~A) = 0; A U (~A) = 5. 
7. ACB, if and only if A = A N B. 
8 


-ONA=O0. 
9. A = B, if and only if A C Band BC A. 


I (AN B)CA. 

12. If A C B, then (ANC) C (B NC), and(A UC) C (B U C). 
13. (A U B) CG, if and only if A C C and BCC. 

14.0 cC ACS. 

15. A N (A UB) =A. 

16. (WA) = A. 

17. ~(A U B) = (~A) N (~B) (De Morgan’s theorem). 
18. ~0 = 8S. ‘ 

19. AN (~A U B)=ANB. 

20. A C B, if and only if (~B) © (~A). 

21. A C B, if and only if A N (~B) = 0. 

22. ~(U; Ad = N: (~A) (General De Morgan’s theorem). 
23. A U (();B) = N: (4 U Bə. 

24. A N (1): B) = N: (4 N B3. 

25. (Ui 4d U (Us Bi) = Uis (A: U B). 

26. (1): 4d U (B) = Mas As U Bi). l 

27. A C (f: B), if and only if AC B; for every 7. 


28. (N: Bd GBE (U: B;) for every J. 


5 Consequences, acts, and decisions 
that a decision is to be made is to say that one of two or more 
acts is to be chosen, or decided on. In deciding on an act, account 
must be taken of the possible states of the world, and also of the con- 
sequences implicit in each act for each possible state of the world. A 
consequence is anything that may happen to the person. i 
Consider an example. Your wife has just broken five good eggs into 
a bowl when you come in and volunteer to finish making the omelet. 
A sixth egg, which for some reason must either be used for the omelet 
or wasted altogether, lies unbroken beside the bowl. You must de- 
cide what to do with this unbroken egg. Perhaps it is not too great an 
oversimplification to say that you must decide among three acts only, 
namely, to break it into the bowl containing the other five, to break it 
into a saucer for inspection, or to throw it away without inspection. 


To say 
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Depending on the state of the egg, each of these three acts will have 
some consequence of concern to you, say that indicated by Table 1. 


TABLE 1. AN EXAMPLE ILLUSTRATING ACTS, STATES, AND CONSEQUENCES 


State 
Act 


Good Rotten 


break into bowl | six-egg omelet no omelet, and five good eggs 


destroyed 
break into saucer | six-egg omelet, and a saucer five-egg omelet, and a saucer 
to wash to wash 


throw away five-egg omelet, and one good | five-egg omelet 


egg destroyed 


Even the little example concerning the omelet suggests how varied 
the things, or experiences, regarded as consequences, can be, They 
might in general involve money, life, state of health, approval of friends, 
well-being of others, the will of God, or anything at all about which the 
person could possibly be concerned. Consequences might appropriately 
be called states of the person, as opposed to states of the world. They 
might also be referred to, with some extension of the economic notion 
of income, as the possible incomes of the person. In any one problem, 
the set of consequences envisaged will be denoted by F, and the indi- 
vidual consequences will be denoted by f, g, h, ete. In the omelet ex- 
ample, F consists of the six consequences tabulated in Table 1: six-ege 
omelet; no omelet, and five good eggs destroyed; etc. 

If two different acts had the same consequences in every state of the 
world, there would from the present point of view be no point in con- 
sidering them two different acts at all. An act may therefore be iden- 
tified with its possible consequences. Or, more formally, an act is a 


function attaching a consequence to each state of the world. The nota- 
tion f will be used to denote an act, that is, a function, attaching the 
consequence f(s) to the state s. The notation f is logically a better 
name for a function than the more customary f(s) for exactly the same 
reason that the word “logarithm” is a better term for logarithm than 
“logarithm of x” would be. The notational distinction involved here is 
often justifiably neglected in mathematical work, but we will have spe- 
cial need to observe it, at least in connection with acts, as will soon be 
explained. When several acts are to be discussed at once, they may be 
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denoted by different letters thus: f, g, h; by the use of primes thus: f, 
f’, f”; or by subscripts thus: f,, f; The set of all acts available in a 
given situation will be denoted by F or a similar symbol. In the ex- 
ample of the omelet, F has three acts as elements. If, for example, f 
denotes the first of the three acts listed in Table 1, then f is defined 
thus: 
f(good) = six-egg omelet; 
(1) 


f(votten) = no omelet, and five good eggs destroyed. 


The argument might be raised that the formal description of decision 
that has thus been erected seems inadequate because a person may not 
know the consequences of the acts open to him in each state of the 
world. He might be so ignorant, for example, as not to be sure whether 
one rotten egg will spoil a six-egg omelet. But in that case nothing 
could be simpler than to admit that there are four states in the world 
Corresponding to the two states of the egg and the two conceivable 
answers to the culinary question whether one bad egg will spoil a six- 
egg omelet. It seems to me obvious that this solution works in the 
greatest generality, though a thoroughgoing analysis might not be triv- 
ial. A reader interested in the technicalities of this point or that of 
the succeeding paragraph will find an extensive discussion of a similar 
problem in Chapter II of [V4], where von Neumann and Morgenstern 
discuss the reduction of a general game to its reduced form. ; 

Again, the formal description might seem inadequate in that it does 
not provide explicitly for the possibility that one decision may lead to 
another. Thus, if the omelet should be spoiled by breaking a rotten 
egg into it, new questions might arise about what to substitute for 
breakfast and how to appease your justifiably furious wife. But, just 
as in the preceding paragraph an apparent shortcoming of the proposed 
mode of description was attributed to an incomplete analysis of the 
Possible states, here I would say that the list of available acts envisaged 
in Table 1 is inadequate for the interpretation that has just been put 
on the problem. Where the single act “break into bowl” now stands, 
there should be several, such as: “break into bowl, and in case of dis- 
aster have toast,” “break into bowl, and in case of disaster take family 
to a neighboring restaurant for breakfast.” Appropriate consequences 
of these new acts can easily be imagined. , o. 

As has just been suggested, what in the ordinary way of thinking 
might be regarded as a chain of decisions, one leading to the other in 
time, is in the formal description proposed here regarded as a single de- 
cision. To put it a little differently, it is proposed that the choice of a 
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policy or plan be regarded as a single decision. This point of view, 
though not always in so explicit a form, has played a prominent role 
in the statistical advances of the present century. For example, the 
great majority of experimentalists, even today, suppose that the func- 
tion of statistics and of statisticians is to decide what conclusions to 
draw from data gathered in an experiment or other observational pro- 
gram. But statisticians hold it to be lacking in foresight to gather data 
without a view to the method of analysis to be employed, that is, they 
hold that the design and analysis of an experiment should be decided 
upon as an articulated whole. 

The point of view under discussion may be symbolized by the prov- 
erb, “Look before you leap,” and the one to which it is opposed by the 
proverb, “You can cross that bridge when you come to it.” When two 
proverbs conflict in this way, it is proverbially true that there is some 
truth in both of them, but rarely, if ever, can their common truth be 
captured by a single pat proverb. One must indeed look before he 
leaps, in so faras the looking is not unreasonably time-consuming and 
otherwise expensive; but there are innumerable bridges one cannot 
afford to cross, unless he happens to come to them. 

Carried to its logical extreme, the “Look before you leap” 
demands that one envisage every conceivable policy for the go 
of his whole life (at least from now on) in i 
the light of the vast number of unknown states of the world, and decide 
here and now on one policy. This is utterly ridiculous, not—as some 
might think—because there might later be cause for regret, if things 
did not turn out as had been anticipated, but because the task implied 
in making such a decision is not even remotely resembled by human 
possibility. It is even utterly beyond our power to plan a picnic or to 
play a game of chess in accordance with the principle, even when the 
world of states and the set of available acts to be envisaged are artifi- 
cially reduced to the narrowest reasonable lim 


its. 
3 Though the “Look before you leap” principle is preposterous if car- 
ried to extremes, I would no: 


r j none the less argue that it is the proper sub- 
ject of our further discussion, because to cross one’s bridges when one 


tack relatively simple problems of decision 
ention to so small a world that the “Look 
an be applied there. I am unable to formu- 
ese a worlds and indeed believe that 
their ter of judgment <peri zhi 
it is impossible to enunciate CODESA a heel ie 
hin ore will be said in this connection in § 5.5. 
it is an operation in which we all necessarily have 


principle 
vernment 
ts most minute details, in 
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much experience, and one in which there is in practice considerable 
agreement. 

In view of the “Look before you leap” principle, acts and decisions, 
like events, are timeless. The person decides “now” once for all; there 
is nothing for him to wait for, because his one decision provides for all 
contingencies. None the less, temporal modes of description, though 
translatable into atemporal ones, are often suggestive. Thus, there 
will be occasion to analyze and make frequent use of the idea of defer- 
ring a decision until an observation relevant to it has been made. 


6 The simple ordering of acts with respect to preference 


Of two acts f and g, it is possible that the person prefers f to g. 
Loosely speaking, this means that, if he were required to decide between 
f and g, no other acts being available, he would decide on f. i 

This procedure for testing preference is not entirely adequate, if only 
because it fails to take account of, or even define, the possibility that 
the person may not really have any preference between f and g, re- 
garding them as equivalent; in which case his choice of f should not be 
regarded as significant. If the person really does regard f and g as 
equivalent, that is, if he is indifferent between them, then, if fog 
were modified by attaching an arbitrarily small bonus to its conse- 
quences in every state, the person’s decision would presumably be for 
whichever act was thus modified. This test for indifference does not 
provide an altogether satisfactory definition, since it begs the question 
ating in effect that the tester knows what con- 
stitutes a small bonus. Another attempted solution would be to say 
that the person knows by introspection whether he has decided hap- 
hazardly or in response to a definite feeling of preference. This ory of 
solution seems to me especially objectionable, because I think it of 
great importance that preference, and indifference, between f cu g “a 
determined, at least in principle, by decisions between acts ut y 
response to introspective questions. Tn spite of the aa ty n nd 
tinguishing between preference and indifference, I thin oe as 
been said for us to proceed to a postulational treatment of them. 


to some extent by postul: 


son cannot simultaneously prefer f to g and g tof. In Lis a ee 
treatment of the relationships of preference and so gee ee Y bs e 
technically convenient to work with the relation : is ae oo 
Thus, rather than say that it is impossible that both f is pr 

& and g to f, I might say that, of any two acts f and g, f is not preferred 


18 PRELIMINARY CONSIDERATIONS ON DECISION [2.6 


to g or g is not preferred to f, possibly both. Again, the definition of 
preference suggests that, if f is not preferred to g, and g is not preferred 
to h, then it is impossible that f should be preferred to h. 

The two assumptions just made about the relation “is not preferred 
to” is sometimes expressed in ordinary mathematical usage by saying 
that the relation is a simple ordering among acts. Formally, a relation 
<: among a set of elements x, y, z ---, is called a simple ordering, in 
this book, if and only if for every x, y, and z: 


1. Hither x <- y, ory <- z. 
2. Ifa <- y, and y <- z, then x <. z. 


Borrowing from arithmetic the suggestive abbreviation < for the re- 
lation “is not preferred to,” the assumption that < is a simple order- 
ing can be expressed formally by a postulate, thus: 


P1 The relation < is a simple ordering among acts. 


It is noteworthy that P1 makes no explicit reference to states of the 
world. Except possibly for mathematical refinements, f it seems to me 
that no additional postulates can be formulated without making such 
reference—at any rate none will be in this book. 

P1 by itself is not very rich in consequences, but one easily proved 
theorem following from it may be mentioned. 


THEOREM 1 If F is a finite set of acts, there exist f and h in F such 
that for all g in F 


f<g<h. 


Theorem 1 is especially relevant to application of the theory of de- 
cision, because I interpret the theory to imply that, if F is finite, the 
person will decide on an act h in F to which no other 
ferred, the existence of at le: 
theorem. 


act in F is pre- 
ast one such h being guaranteed by the 


It is often appropriate to consider infinite sets of available acts. In 
economic contexts, for example, it is generally an inappropriate com- 
plication to take explicit account of the possibility that all transactions 
must be in integral numbers of pennies. If infinite sets of available acts 
are set up and interpreted without some mathematical tact, unrealistic 
conclusions are likely to follow. Suppose, for example, that you were 
free to choose any income, provided it be definitely less than $100,000 
per year. Precisely which income would you choose, abstracting from 
the indivisibility of pennies? 


t For example, such topolo, 


{ gical assumptions about the space with neighborhoods 
defined in terms of < as coni 


nectedness, local compactnesss, or density. 
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It is sometimes convenient to supplement the relation < by other 
relations derived from it in accordance with the definitions in Table 1, 
analogous definitions being applicable to any simple ordering. The as- 
sumption of simple ordering, P1, has several implications for the de- 
rived relations >, <, >, and =. These are generally strongly sug- 
gested by the properties of the corresponding relations in arithmetic. 


TABLE 1. TABLE OF RELATIONS DERIVED FROM < 


New Relation Definition 
IŽE gS. 
f < g, i.e., g is preferred to f. It is false that g < f. 
f >E. g<f. 


f = g, i.e., fis equivalent to (or f < g, andg < f. 
indifferent with respect to) g. 

g is between f and h. f<g<horh<gX<f. 

A few such implications of P1 are listed below, with no intention of 

completeness, as exercises for those who may not already be familiar 

with the elementary properties of simple ordering. 


Exercises 


1. The relation > is also a simple ordering. 


2. All the relations <, >, <, >, and = are transitive, that is, they 
can be validly substituted for < in the second part of the definition of 
simple ordering. 

g 3. Between any pair of acts f, g, one and only 
tions <, =, and > holds. 


one of the three rela- 


4. Iff < g, and g = h, then f < h. 
5. Iff = g, then g =f. 
6. For any f, f = f. 


7. At least one of three acts f, g, h is between the other two. When 


can there be more than one such? 
ons can be made of P1 and 


Two very different sorts of interpretati 
? First, P1 can be regarded as 


the other postulates to be adduced later. ; egarde 
a prediction about the behavior of people, or animals, in decision situa- 
tions. Second, it can be regarded as a logic-like criterion of consist- 
eney in denision situations. For us, the second interpretation is the 
Only one of direct relevance, but it may be fruitful to discuss both, 
Calling the first empirical and the second normative. 
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Logic itself admits an empirical as well as a normative interpreta- 
tion. Thus, if an experimental subject believes certain propositions, 
it is to be expected that he will also believe their logical consequences 
and disbelieve the negations of these consequences. This theory of hu- 
man psychology has some validity and is of great practical utility in our 
everyday dealings with other people, though it is very crude and ap- 
proximate. For one thing, people often do make elementary mistakes 
in logic; more refined theories would attribute these mistakes to such 
things as accident or subconscious motivation. For another, if any- 
one who believed the axioms of mathematics also believed all that they 
imply and nothing that they contradict, mathematical study would be 
superfluous for him; such a person would, as has been explained, be 
able to state the ten-thousandth or any other term in the decimal ex- 
pansion of m on demand. To summarize, logic can be interpreted as a 
crude but sometimes handy empirical psychological theory. 

The principal value of logic, however, is in connection with its norma- 
tive interpretation, that is, as a set of criteria by which to detect, with 
sufficient trouble, any inconsistencies there may be among our beliefs, 
and to derive from the beliefs we already hold such new ones as con- 
sistency demands. It does not seem appropriate here to attempt an 
analysis of why and in what contexts we wish to be consistent; it is 
sufficient to allude to the fact that we often do wish to be so. 

Analogously, P1 together with the postulates to be adduced later can 
be interpreted as a crude and shallow empirical theory predicting the 


ns. This theory is practical in suitably 


e toward it is much like 
calling the departure a mistake and 
ident and subconscious motivation. 


coming increasingly complicated as 
nd beside P1. 
c, the main use I would make of P1 
o police my own decisions for consist- 
1 to make complicated decisions depend on 
simpler ones, 

Here it is more pertinent than it was in connection with logic that 
something be said of why and when consistency is a desideratum though 
I cannot say much. S ; 


i uppose someone says to me, “I am a rational 
person, that is to say, I seldom, if ever, make mistakes in logic. But I 


2.7] THE SURE-THING PRINCIPLE 21 


behave in flagrant disagreement with your postulates, because they vio- 
late my personal taste, and it seems to me more sensible to cater to my 
taste than to a theory arbitrarily concocted by you.” I don’t see how 
I could really controvert him, but I would be inclined to match his in- 
trospection with some of my own. I would, in particular, tell him that, 
when it is explicitly brought to my attention that I have shown a pref- 
erence for f as compared with g, for g as compared with h, and for h as 
compared with f, I feel uncomfortable in much the same way that I do 
when it is brought to my attention that some of my beliefs are logically 
contradictory. Whenever I examine such a triple of preferences on my 
own part, I find that it is.not at all difficult to reverse one of them. In 
fact, I find on contemplating the three alleged preferences side by side 
that at least one among them is not a preference at all, at any rate not 
any more. 

There is some temptation to explore the 
preference among acts as a partial ordering, that is, 
part 1 of the definition of simple ordering by the very weak proposition 
f < f, admitting that some pairs of acts are incomparable. This would 
seem to give expression to introspective sensations of indecision or vacil- 
lation, which we may be reluctant to identify with indifference. My 
own conjecture is that it would prove a blind alley losing much in power 
and advancing little, if at all, in realism; but only an enthusiastic ex- 


ploration could shed real light on the question. 


possibilities of analyzing 
in effect to replace 


T The sure-thing principle 

A businessman contemplates buying 
Considers the outcome of the next presidential election relevant to the 
attractiveness of the purchase. So, to clarify the matter for himself, 
he asks whether he would buy if he knew that the Republican candidate 
Were going to win, and decides that he would do so. Similarly, he con- 
siders whether he would buy if he knew that the Democratic candidate 
were going to win, and again finds that he would do so. Seeing that he 
would buy in either event, he decides that he should buy, even though 
he does not know which event obtains, or will obtain, as we would ordi- 
harily say. It is all too seldom that a decision can be arrived at on the 
basis of the principle used by this businessman, but, except possibly 
for the assumption of simple ordering, I know of no other extralogical 
Principle governing decisions that finds such ready acceptance. 3 

Having suggested what I shall tentatively call the sure-thing prin- 
ciple, let me give it relatively formal statement thus: If the person 
would not prefer f to g, either knowing that the event B obtained, or 
knowing that the event ~B obtained, then he does not prefer f to g. 


a certain piece of property. He 
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Moreover (provided he does not regard B as virtually impossible) if he 
would definitely prefer g to f, knowing that B obtained, and, if he would 
not prefer f to g, knowing that B did not obtain, then he definitely pre- 
fers g to f. . 

The sure-thing principle cannot appropriately be accepted as a postu- 
late in the sense that P1 is, because it would introduce new undefined 
technical terms referring to knowledge and possibility that would ren- 
der it mathematically useless without still more postulates governing 
these terms. It will be preferable to regard the principle as a loose one 
that suggests certain formal postulates well articulated with P1. 

What technical interpretation can be attached to the idea that f 
would be preferred to g, if B were known to obtain? Under any rea- 
sonable interpretation, the matter would seem not to depend on the 
values f and g assume at states outside of B. There is, then, no loss 
of generality in supposing that f and g agree with each other except in 
B, that is, that f(s) = g(s) for all s€ ~B. Under this unrestrictive as- 
sumption, f and g are surely to be regarded as equivalent given ~B; 
that is, they would be considered equivalent, if it were known that B 
did not obtain. The first part of the sure-thing principle can now be 
interpreted thus: If, after being modified so as to agree with one an- 
other outside of B, f is not preferred to g; then f would not be preferred 
to g, if B were known. The notion will be expressed formally by say- 
ing that f < g given B. 

It is implicit in the argument that has just led to the definition of 
f < g given B that, if two acts f and g are so modified in ~B as to agree 
with each other, then the order of preference obtaining between the 
modified acts will not depend on which of the permitted modifications 
was actually carried out. Equivalently, if f and g are two acts that do 
agree with each other in ~B, and f < g; then, if f and g are modified 
in ~B in any way such that the modified acts f’ and g’ continue to 
agree with each other in ~B, it will also be so that f’ < g’. This as- 


sumption is made formally in the postulate P2 below and illustrated 
schematically in Figure 1, a kind of di ive i 
such contexts. 

In Figure 1, the set S of all states s an 


f are represented by horizontal and vert 
any such diagram an a 


d the set F of all consequences 
ical intervals respectively. In 
g ct f, being a function attaching a value f(s) e F 
to each s e § is represented by a graph. This particular diagram graphs 
two acts f and g that agree with each other in ~B, and two other acts 
f’ and g’ that also agree with each other in ~B and arise by modifying 
f and g respectively only in ~B, that is, acts agreeing with f and g 
respectively in B. 
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a(s,)=2'(s,) 
f'(8) = 8 (Sq) 


< 


fs) =f'(s; 
fls) =8lS3) 


Figure 1 


P2 _ If, g, and f’, g’ are such that: 
1. in ~B, f agrees with g, and f’ agrees with g’, 
Fi a B, f agrees with f’, and g agrees with g’, 
-f< g; 
then f’ < g. 


Each of the relations “< given B” is now easily seen to be a simple 
ordering, and the relations “>, <, >, = given B” are to be defined 
PAR mutandis. It is noteworthy though obvious that, if f(s) = g(s) 
or all s £ B, then f = g given B. 
It is now possible and instructive to give an atemporal analysis of 
ried tons temporally described decision situation: The person must 
ecide between f and g after he finds out, that is, observes, whether B 
obtains; what will his decision be if he finds out that B does in fact 
obtain? 
Atemporally, the person can 
£ or else of g for all s € B, and, in 
the consequences of f or else of g for 


he decide upon for the s’s in B? 
R Finally, describing the situation not only atemporally but also quite 


formally, the person must decide among four acts defined thus: 


ees with f on B and with f on ~B, 
on B and with g on ~B, 
on B and with f on ~B, 
on B and with g on ~B. 


submit himself to the consequences of 
dependently, he can submit himself to 
all s ¢ ~B; which alternative will 


hoo agr 
ho: agrees with f 
hyo agrees with g 
hıı agrees with g 


The question at issue now takes this form. Supposing that none of 
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the four functions is preferred to the particular one hiz is ¿ = 0, oris 
i = 1; that is, does h,; agree with f on B or with g on B? f 

It is not hard to see that 7 can be 1, if and only if f < g given B. In- 
deed, if ¿ = 1, ho; < h;;, which means that f < g given B. Arguing in 
the opposite direction, if f < g given B; then hoo < hyo, and ho; < hui. 
Suppose now, for definiteness, hig < hıı, then none of the four possi- 
bilities is preferred to hy; this proves the point in question. 

It may fairly be said that the person considers B virtually impossible, 
or that B is null; if and only if, for all f and g, f < g given B. Indeed, 
if B is null in this sense, the values acts take on elements of B are irrele- 
vant to all decisions. 

Several trivial conclusions about null events are listed as a compound 


theorem, all components but the last of which have immediate intuitive 
interpretations. 


THEOREM 1 


1. The vacuous event, 0, is null. 

2. B is null, if and only if, for every f and g, Í = g given B. 
3. If B is null, and B D C; then C is null. 

4. If ~B is null; f < g given B, if and only if f < g. 

5. f < g given S, if and only if f <g. 

6. If S is null, f = g for every f and g. 


Component 6 of Theorem 1 requires comment, because it corresponds 
to a pathological situation. In case S is null, it is not really intuitive 
to say that S (and therefore every event) is virtually impossible. ‘The 
interpretation is rather that the person simply doesn’t care what hap- 
pens to him. This is imaginable, especially under a suitably restricted 
interpretation of F, but it is uninteresting and will accordingly be ruled 
out by a later postulate, P5. 

A finite set of events B; is a partition of B; if B; N B; = 0, fori = Js 
and U; B; = B. With this definition, it is easily proved by arithmetic 
induction that 
THEOREM 2 If B; isa 
then f < g given B. 
then f < g given B. 


partition of B, and f < g given B; for each å, 
if, in addition, f < g given B; for at least one ds 


COROLLARY 1 The union of any finite number of null events is null. 


There are still other inter 
may be most conveniently 
B = § (or, more generally, 


esting consequences of Theorem 2, which 
Mentioned informally. If, in Theorem 2, 
if ~B is null), it is superfluous to say “given 
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B” in the conclusions of the theorem. If f= g given B; for each 7, 
then f = g given B. So much for the consequences of P2. 

Acts that are constant, that is, acts whose consequences are inde- 
pendent of the state of the world, are of special interest. In particular, 
they lead to a natural definition of preference among consequences in 
terms of preference among acts. Following ordinary mathematical us- 
age, f = g will mean that f is identically g, that is, for every s, f(s) = g. 
A formal definition of preference among consequences can now con- 
veniently be expressed thus. For any consequences g and g’, g < g’; 
if and only if, when f = g and f' = g’, f < f'. 

In the same spirit, meaning can be assigned to such expressions as 
f < g, g < f given B, etc., and I will freely use such expressions without 
defining them explicitly. In particular, f < g given B has a natural 
meaning, but one that is rendered superfluous by the next postulate, 
P3. 

Incidentally, it is now evident how awkward for us it would be to 
use f(s) for f; because f(s) < g(s) is a statement about the consequences 
F(s) and g(s), whereas f < g is a statement about acts, and we will 
have frequent need for both sorts of statements. 


Suppose that f = g, and f' = g', and that g < g', is it reasonable to 


admit that, for some B, f > f’ given B? That depends largely on the 
interpretation we choose to make of our technical terms, as an example 


helps to bring out. 
2 th friends, a person decides to buy a 


Before going on a picnie wi 
bathing suit or a tennis racket, not having at the moment enough money 
for both. If we call possession of the tennis racket and possession of 
the bathing suit consequences, then we must say that the consequences 
of his decision will be independent of where the picnic is actually held. 
If the person prefers the bathing suit, this decision would presumably 
be reversed, if he learned that the picnic were not going to be held 


near water. Thus the question whether it can happen that f > if 
given B would be answered in the affirmative. But, under the interpre- 
tation of “act” and “consequence” I am trying to formulate, this is 


hot the correct analysis of the situation. The possession ie ey tennis 
racket and the possession of the bathing suit are to be regar e = acis 
not consequences. (It would be equivalent and more in accor Ca 
with ordinary discourse to say that the coming into gies or the 

uying, of them are acts.) The consequences relevant to mes — 
are such as these: a refreshing swim with friends, sitting oe ' hadeless 
beach twiddling a brand-new vhile one’s friends swim, 


tennis racket wW ne’s fri 
ete. It seems clear that, if this analysis is carried to its limit, the ques- 
tion at issue must be answered in t 


he negative; and I therefore propose 
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to assume the negative answer as a postulate. The postulate is so 
couched as not only to assert that knowledge of an event cannot estab- 
lish a new preference among consequences or reverse an old one, but 
also to assert that, if the event is not null, no preference among conse- 
quences can be reduced to indifference by knowledge of an event. 


P3 If f = g, f' = g', and B is not null; then f < f' given B, if and 
only if g <q’. 


Applying Theorem 2, it is obvious that 


THEOREM 3 If B; is a partition of B; and if (for all 7 and 8) fi < gi 
f(s) = fi, and g(s) = gi when s e B;; then f < g given B. If, in addi- 
tion, f; < g; for some j for which B; is not null, then f < g. 


Theorem 3 is logically equivalent to P3 in the presence of P1 and P2, 
and Theorem 3 can as easily be given an intuitive basis as the postulate 
P3. Therefore the assumption of P3 as a postulate instead of Theorem 
3 is only a matter of taste. 
pted by the British-American School 
s having been given to it, in connection 
, by the late Abraham Wald. I believe, 
as will be more fully explained later, that much of its particular sig- 

from the implication that, if several 
preferences among consequences, then 
eferences among certain acts. 


CHAPTER 3 


Personal Probability 


1 Introduction 


I personally consider it more probable that a Republican president 
will be elected in 1996 than that it will snow in Chicago sometime in the 
month of May, 1994. But even this late spring snow seems to me more 
probable than that Adolf Hitler is still alive. Many, after careful con- 
sideration, are convinced that such statements about probability to a 
person mean precisely nothing, or at any rate that they mean nothing 
precisely. At the opposite extreme, others hold the meaning to be so 
self-evident as to be unanalyzable. An intermediate position is taken 
in this chapter, where a particular interpretation of probability to a 
Person is given in terms of the theory of consistent decision in the face 
of uncertainty, the exposition of which was begun in the last chapter. 
Much as I hope that the notion of probability defined here is consistent 
with ordinary usage, it should be judged by the contribution it makes 
to the theory of decision, not by the accuracy with which it analyzes 
ordinary usage. 

Perhaps the first way that suggests itself to find out which of two 
events a person considers more probable is simply to ask him. It might 
even be argued, though I think fallaciously, that, since the question 
concerns what is inside the person’s head, there can be no other method, 
just as we have little, if any, access to a person’s dreams except through 
his verbal report. Attempts to define the relative probability of a pair 
of events in terms of the answers people give to direct interrogation 
has justifiably met with antipathy from most statistical theorists. In 
the first place, many doubt that the concept “more probable to me 
than” is an intuitive one, open to no ambiguity and yet admitting no 
n if the concept were so completely intuitive, 


further analysis. Eve ew 
worthy of some 


Which might justify direct interrogation as a subject 

Psychological study, what could such interrogation have to do with the 

behavior of a person in the face of uncertainty, except of course for his 

Verbal behavior under interrogation? If the state of mind in question 

18 not capable of manifesting itself in some sort of extraverbal behavior, 
27 
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it is extraneous to our main interest. If, on the other hand, it does 
manifest itself through more material behavior, that should, at least 
in principle, imply the possibility of testing whether a person holds 
one event to be more probable than another, by some behavior express- 
ing, and giving meaning to, his judgment. It would, in short, be pref- 
erable, at least in principle, to interrogate the person, not literally 
through his verbal answer to verbal questions, but rather in a figurative 
sense somewhat reminiscent of that in which a scientific experiment is 
sometimes spoken of as an interrogation of nature. Several schemes of 
behavioral, as opposed to direct, interrogation have been proposed. 
The one introduced below was suggested to me by a passage of de Fi- 


netti’s (on pp. 5-6 of [D2]), though the passage itself does not empha- 
size behavioral interrogation. 


To illustrate the scheme, 
eggs from his icebox and hold 
whether he thinks it more 
that the white one is. 


our idealized person has just taken two 
s them unbroken in his hand. We wonder 
probable that the brown one is good than 
Our curiosity being real, we are prepared to 
pay, if necessary, to have it satisfied. We therefore address him thus: 
“We see that you are about to open those eggs. If you will be so co- 
operative as to guess that one or the other egg is good, we will pay you 


If incorrect, you and we 
xchange your two eggs for 
If under these circumstances the person 

on the brown egg, it seems to me to 
correspond well with ordinary usage to say that it is more probable to 
than that the white one is. Though, 
ent on this analysis of ordinary usage, 
damental to the subsequent argument, 

as indeed no such lexicographical point could be; for the utility of a 
construct or definition depends only secondarily on the aptness of the 
expression in terms of which it is couched. 

There is a mode of interrogation intermediate between what I have 
he direct. One can, namely, ask the person, 

he would do in such and such a situation. 
decision under development is regarded as 
i mediate mode is a compromise between econ- 
omy and rigor. But, in the theory’s more important normative inter- 
pretation as a set of criteria of consistency for us to apply to our own 
decisions, the intermediate mode seems to me to be just the right 
one. 


him that the brown one is good 
of course, I hope for your agreemi 
I repeat that it is not really fun 


Though it entails digression from the main theme, some readers may 
be interested in a few words about actual experimentation on strictly 
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empirical behavioral interrogation. Some key references bearing on 
the subject are [M4], [R3], and [W8]. 

In the first place, a little reflection shows that an experiment in which 
human subjects are required to decide among actual acts may be very 
expensive in time, money, and effort, especially if the consequences en- 
visaged are expensive to provide, a point discussed in detail in [W8]. 
Questions of morality, and even of legality, toward the subject may 
further complicate the investigation. For example, Mosteller and No- 
gee, as described in Section 3B of [M4], made certain that every sub- 
ject in one experiment of theirs would be financially benefited, though 
they kept this security secret from the subjects. 

There is also a difficulty in principle. Suppose that I wish to dis- 
cover a person’s preferences among several acts—three acts f, g, and h 
are sufficient to bring out the difficulty. If I in good faith offer him the 
opportunity to decide among all three, and he decides on f; then there 
is no further possibility of discovering what his preference was between 
gandh. Suppose, for example, that a hot man actually prefers a swim, 
a shower, and a glass of beer, in that order. Once he decides on, and 
thereby becomes entitled to, the swim, he can no longer appropriately 
be asked to decide between shower and beer. A naive attempt to do so 
would result in his deciding between a swim and shower on the one 
hand, and a swim and beer on the other—an altogether different situa- 
tion from the one intended. ; ; 

The difficulty can sometimes be met by special devices. For example, 
the investigator might wait for a different but “similar” oceasion. But 
W. Allen Wallis has mentioned to me an interesting and very general 
device, which will now be described, with his permission. t k 

Suppose that the hot man is instructed to rank the three acts in 
order, subject to the consideration that two of them will be drawn at 
random (e.g., by card drawing or dice rolling), and that he is then to 
have whichever of these two acts he has assigned the lower rank. He 
is thus called on to select one of six acts, that is, one of the six possible 
rankings. If he does, for example, select the ranking {swim, mae 
beer}, it follows easily from the theory of decision thus far deve ope 
that for him swim > shower > beer, barring the farfetched possibility 
that he regards one or more of the three drawings as vir satay on 
ble and provided that his preference among the three acts rit ee 3 
beer given any of the three drawings is the same as his onena pre: an’ 
ence. The investigator could in practice design the drawing in such a 


tT have since seen this same device used by M. Allais. 
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way as to be well satisfied that the required “irrelevance” obtained, ex- 
cept for very “superstitious” people. This ends the present digression on 
actual behavioral interrogation. 

The purpose of this chapter is to explore the concept of personal 
probability ł that was indicated in the example about the two eggs. 
The concept will be put on a formal basis in § 2 by introducing two new 
postulates, P4 and P5, to be used in conjunction with P1-3. This will 
lead to a formal analysis of the notion that one event is no more prob- 
able than another. Several deductions about this notion reminiscent 
of mathematical properties ordinarily attributed to probability will be 
made; but only in § 3, after adjunction of still another postulate, P6, 
can the notion be connected quantitatively with what mathematicians 
ordinarily call mathematical probability. Section 4 is devoted to some 
mathematically technical criticisms of the 


notion of personal proba- 
bility, which can safely be skipped or skimmed by those not interested 
in such matters. 


Section 5 discusses conditional personal probability; 
6, the approach to certainty through a long sequence of conditionally 
independent relevant observations; and 7, an extension of the concept 
of a sequence of independent events, particularly interesting from the 
viewpoint of personal probability. 


2 Qualitative personal probability 


When I spoke in the introd 
dollar if his guess about the e 
that his guess would not be 
That seems to me correct in 
reasonable for the person w 
the prize were reduced fro 


uctory section of offering the person a 
88 proved correct, it was tacitly assumed 
affected by the amount of the prize offered. 
principle. It would, for example, seem un- 


ith the two eggs to reverse his decision if 


Possibility carefully, I suspect 


© so formally leads to fruitless and endless re- 


gression. 


t The term “personal probability” w. 
Fry. Some other terms suggested for th 


as suggested to me orally by Thornton C. 
“psychological probability,” 


e same concept are “subjective probability,” 
and “degree of conviction.” 
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To offer a prize in case A obtains means to make available to the per- 
son an act f4 such that 


fa(s) =f fors <A, 


(1) 
fal) =f' forse ~A, 


where f’ < f. The assumption that on which of two events the person 

will choose to stake a given prize does not depend on the prize itself 

is expressed by the following postulate, which looks formidable only 
because it contains four definitions like (1). The reader may find it 

Selpi to graph an instance of the postulate in the spirit of Figure 
S 


PL IFS, f’, 9, 9'; A, B; fa, fa, a, gg are such that: 


1. Sf SIs gd <9 

2a. fals) =f, gals) =9 forse A, 
fal) =f, gals) =’ — forse~A; 

2b. fels) =f, g(s) =g  forseB, 
fel) =f, gals) =9' fors e~B; 

a: fa < fs; 


then g4 < gp. 


In the light of P4, it will be said th 
B, abbreviated A < B; if and only if when f’ <f 


at A is not more probable than 
and f4, fg are such 


that 

fal) =f forse, fa(s) =f forse ~A, 

tals) =f forseB, fel) =f’ forse ~B; 
then f4 < fr: PORIS 
lss at least one worth-while prize is in- 


The assumption that there is 6 WONU p 
nocuous; for, though a context failing to satisfy it might arise, such a 


Context would be too trivial to merit study. I therefore propose the 


ollowing postulate. 
P5 There is at least one pair of consequences f, f’ such that f’ < f. 

All the implications to be deduced from P1-5 for some time to come 
are themselves implications of the three easily established conclusions, 
Which are introduced by the following definition and theorem. 
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A relation <- between events is a qualitative probability; if and only 
if, for all events B, C, D, 


1. <- is a simple ordering, 

2. B <-C, if and only if BUD<-C UD, provided BN D = 
cND=0, 

3. 0 <- B, 0 <- 5. 


It may be helpful to remark that the second part of the above defini- 
tion says, in effect, that it will not affect the person’s guess to offer 


him a consolation prize in case neither B nor C obtains, but D happens 
to. 


THEOREM 1 The relation < as applied to events is a qualitative 
probability. 


You will have no difficulty in proving that Theorem 1 follows from 
P1-5. Theorem 1 has many consequences of the sort one would expect 
if < meant “not more probable than” in any sense having the mathe- 


matical properties ordinarily attributed to numerical probability. This 
is illustrated by the following list of exercises, which should not only 
be proved formally, but also interpreted intuitively. One easy exercise 
not included in the list below, because it is not strictly a consequence 
of Theorem 1 alone, is to show that B = 0, if and only if B is a null 
event. 


Exercises 


Ly IfBCC,thn0<B<C<sg. 

2a. if BND=CND=0; then B<C, if and only if BUD < 
CUD. 

2b. If0 < C, and BN C=0;thnB<BUG. 

3. If B < C, then ~C < ~B j and conversely. Hint: Draw a Venn 
diagram of the fourfold partition B N CG, SB 6, BN we we Nn 
RC. : 


4a. EB<C, andCND=0;thnBUD<CUTD. 

4b. If B < 0; thn BUC = C, and B = 0. 

4c. If S < B; thn BNC = C, and B = §, 

4d. IBU D <C U D, and B N D = 0; then B < C. 

BAE Pi = Cy Bas Cy, and Ci 0 Ca — 0; then B, U Be < c, U 
Co, Hint: Exhibit By and C; in the form By = By’ U 00-6, UQ 
with Bs’, Cy’, Q disjoint, J ustify the following calculation, step by step. 

Bı U By < 0, U By = Cy U B, < Cy 


U Co, 
whence By U Bo < C, U Cy, 


—— eer) ar ee 
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5b. If Bı U Bə < Cı U Ce and Bı N Bə = 0; then By < Cı or 
Bo < Co. 

6. If B < ~B and C > ~C, then B < C; equality holding in the 
conclusion, if and only if it holds in both parts of the hypothesis. 


3 Quantitative personal probability 

As I have said, the exercises terminating the preceding section sug- 
gest a close mathematical parallelism between personal probability and 
the mathematical properties ordinarily attributed to probability, though 
the postulates assumed thus far do not (as could easily be demonstrated) 
make it possible to deduce from this parallelism the unambiguous as- 
signment of a numerical probability to each event. But, if, for example 
(following de Finetti [D2]), a new postulate asserting that S can be 
partitioned into an arbitrarily large number of equivalent subsets were 
assumed, it is pretty clear (and de Finetti explicitly shows in [D2]) 
that numerical probabilities could be so assigned. It might fairly be 
objected that such a postulate would be flagrantly ad hoc. On the 
other hand, such a postulate could be made relatively acceptable by 
observing that it will obtain if, for example, in all the world there is a 
coin that the person is firmly convinced is fair, that is, a coin such that 
any finite sequence of heads and tails is for him no more probable than 
any other sequence of the same length; though such a coin is, to be sure, 
a considerable idealization. 

After some general and abstra 
nection between qualitative and 


ct discussion of the mathematical con- 

quantitative probability, a postulate, 
P6, will be proposed, which, though logically actually stronger than the 
assumption that there are partitions of S into equivalent events, seems 
to me even easier to accept. Once P6 is accepted, there will scarcely 
again be any need to refer directly to qualitative probability. 

To begin with, let me say precisely what is meant, in the present 
Context, by a probability measure, this being the standard term for 
what I would here otherwise prefer to call a quantitative probability, 
and what it means for a probability measure to be in agreement with 
& qualitati : ility. 

‘A pec aees on a set S is a function P(B) attaching to 
each B C § a real number such that: 


1. P(B) > 0 for every B. 
2. If B N C = 0, P(B U C) =P) + P(C). 
3. P(S) = 1. 
This definition, or something 
mathematical work in probability. 


very like it, is at the root of all ordinary 
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If S carries a probability measure P and a qualitative probability 
<. such that, for every B, C, P(B) < P(C), if and only if B <- C; 
then P (strictly) agrees with <-. If B <-C implies P(B) < P(C), 
then P almost agrees with <-. This terminology is obviously con- 
sistent in that, if P agrees, that is, strictly agrees, with <-, P also al- 
most agrees with <». It is also easily seen that, if P agrees with <-, 
then knowledge of P implies knowledge of <-. But, if P only almost 
agrees with <-, it may happen, as examples in § 4 show, that P(B) = 
P(C), though B <- C, so that knowledge of P may imply only imperfect 
knowledge of <-. 

The rest of this section is mainly a study of qualitative probabilities 
generally, with a view to discovering interesting conditions under which 
there is a probability measure that agrees, either strictly or almost, 
with a given qualitative probability. These conditions suggest a new 
postulate governing the special qualitative probability <. The work 
is necessarily rather tedious and burdened with detail. It will, there- 
fore, be wise for most readers to skim over the material, omitting the 
proofs but noticing the more obvious logical connections among the 
theorems and definitions. Some may then find themselves sufficiently 
interested in the details to return and read or supply the proofs, as the 
case may require. Others may safely go forward. 
technical terms of interest for the moment only ai 
italics rather than boldface. 

An n-fold almost uniform partition of B is an n-fold partition of B 


such that the union of no r elements of the partition is more probable 
than that of any r + 1 elements. 


Here, as elsewhere, 
re introduced with 


THEOREM 1 If there exist n-fold almost uniform partitions of B for 
arbitrarily large values of n, then there exist m-fold almost uniform par- 
titions for every positive integer m. 


Proor. Let B; i = 1, --+, n, be an n-fold almost uniform partition 
(of B) with n > mè. Using the euclidean algorithm, let n be written 
n = am + b, where a and b are inte 
m. Now let Cj, j = 1, sciig Mh 
C; is the union of a or a + 1 of the Bs. The union of any r of the C;’s, 
)r of the B,’s and the union of 
) to (a + 1)(r + 1) of the Bs. 
<a+a=alr+1).@ 
a If apg exist n-fold almost uniform partitions of S for 
arbitrarily large values of I re i ili 
meg b odaad- 5 oe Avie 18 one and only one probability 

grees with <., Furthermore, for any p, 0 < p 


THEOREM 2 
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< 1, any BCS, and the unique P just defined, there exists C C B 
such that P(C) = pP(B).7 

Proor. The proof is broken into a sequence of easy steps, left, for 
the most part, to the reader. These steps are grouped in blocks, only 
the last step in each being needed in the proof of later steps. 

1. There exist n-fold almost uniform partitions of S for every posi- 


tive n. 
2a. If pi, +++, Pn are real numbers such that 0 < pı < po S++ L Ph, 


and Sp; = 1; then 


(1) X p:i < r/n, p= lp i 
1 
2b. If further 
r+ n 
D> a= Pi forr = 1, ge = i 
1 n—r+l1 
then 
š n 
(2) SY p> (r—1)/n, and E p< (r1) 
n=r+1 


1 
2c. The sum of any r of the p;’s lies between (r — 1)/n and (r + 1)/n. 
2d. If P almost agrees with <-, and C(r, n) denotes here and later 


in this proof any union of r elements of any n-fold almost uniform par- 


tition (not necessarily the same from one context to another), then 


(3) (r — 1)/n < PCO, n)) < (r + 1)/n. 


argest integer r (possibly zero) such that 


3. k(B, 3 te the | ù 
R The function k(B, n) is 


some C(r, n) is not more probable than B. 
well-defined, and 0 < k(B, n) Sn. 
da. For any P that almost agrees with <% 


(4) (k(B, n) — 1)/n < P(B) < (k(B, n) + 2)/n. 


aaa i 
4b. At most one P can almost agree with <». ; = 
5a. If B; and C; are n-fold partitions (not necessarily almost uniform) 


so indexed that By <+ B2 <: +++ St Bu and Cy 2+ C2 2+++* 2+ Ch; 


then 
n n 1 

(5) U BU Cs p= n 
je ii | conclusion of this 
t Technical note: The mathematical c ol the a oe hide 
jecorem, and other conclusions related to it, are given by mbes apenas, (Hat 
(S15), It might be conjectured, in analogy with countably ad a va biG [N5] 

: ? i a ’ 

S conclusion means only that P is non-atomie, but that conjectu 


essent 
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5b. If in addition the two partitions are almost uniform, then 


(6) U c: <- U B; T=], «-,n— 2, 
1 
r+2 n 


(Proof. U B>- U B>- UGC U c:) 


4 I . . 
5c. The union of any r elements of one almost uniform partition is 


not more probable than the union of any r + 2 elements of another. 
5d. If B N C = 0, then 


(7) k(B,n) + k(C, n) — 2 < k(B U C, n) < k(B, n) + k(C, n) + 1. 
6a. If a C(r, m) is not more probable than a C(s, n), then 


r— 2 s+2 1 
° ZEN 
m n mn 
(Consider an mn-fold almost uniform partition, 


and use the easily es- 
tablished fact that the union of any t+ 2 elem 


ents of an almost uni- 
form partition is actually more probable than that of any é elements.) 
k(B,m) — k(B, n) 


m n 


6b. 


soe dÍ 
<> +4 


Tm n m 


6c. It is meaningful to define P(B) by 


9) PiBj=pr im An, 
n> o n 


that is, the limit exists. 
7. P(B), as just defined, i 
that almost agrees with <. 


8a. There exist two infinite sequences of sets C, and D, contained 
in B such that: 


s a probability measure, and the only one 


8b. P(Un Cn) > pP(B), P(Un Dy ¥ 
Hye PP(B), P(Un Da) > (1 — p)P(B), and (Un Cy) N 


8c, PUR C) = pP(B). @ 


A few technical terms of localized interest only are now introduced. 
If and only if, for every B >-0, there is a partition of S, no element of 
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which is more probable than B; <- is fine. B and C are almost equiva- 
lent, written B =- C; if and only if for all non-null G and H such that 
BNG=CNH=0, BUG>-C and CUHS-B. It is obvious 
that equivalent events are also almost equivalent. Finally, if and only 
if every pair of almost equivalent events are equivalent, <- is tight. 


THEOREM 3 


Hyp, <. is fine. 
Conct. 1. If B>-0, and C >- 0; there exists D C C such that 
Dg- Dg: B, 

2. If B=G, CH, and BNC=GNH=0; then BUC 
=-GUH. 

3. IfB=-0,@o-H,BUC=-GUH,andBNC=GNH=0; 
then B =. G. , A 

4. Any partition of S into almost equivalent events is an almost uni- 
form partition. i 

5. Any event can be partitioned into two almost equivalent events. 

6. Any event can be partitioned into 2" almost equivalent events, 
for any non-negative integer n. , 

7. There exists one and only one P that almost agrees with <». 
For any B, p (0 < p < 1), and the unique P just defined, there ex- 
ists C c B such that P(C) = pP(B). If B>-0, P(B) > 0. Finally, 
B =. Q, if and only if P(B) = P(C). 

i ? ged that each is easy 

Proor. The parts of the conclusion are so arrange! 
to prove in the light of its predecessors, but proofs for Parts 3 and 5 
are given below. It may be remarked that all parts are trivial conse- 
quences of the last one and have therefore relatively little importance in 


themselves. e 
Part 3. Suppose, for example, BUE< G, BAE=0, and 
a 0; and considen’two cerem d without loss of generality 


(a) If BUC <:S, it may be assume efor ; 
that C n w = 0, whence (B UC) UE>-G UH. ae pes C>-H. 

Let É be partitioned into two non-null events By and £3; 7 bie 
it is absurd to suppose that the part of G outside = Cc ie ; 7 r s à 
would imply C >- G >B U E) there is m G an E x i ~ Ae 
= 0 <. Hl <. Ep. Now UB a BaP eet ‘ 
wh Ta hich is absurd. : 

rep T rr i F can (setting aside the easy P E aer 

=A , POTER G . . 

=: 0) be shown successively that: ae P pi (GNC); (CN H) 
where H>.0 and EC CN G; Sa N cs a eontra ian. 
<- (G N B); and H U E <- G, whie 
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Part 5. There exists a sequence of threefold partitions of B, say 
Cn, Dn, and Ga, such that: 


1. Cn U Gn >: Dn, and Dn U Gn >+ Cn, 
2. Cna D Cn, Dn4i 2 Dp, and Gaga E G 


3. ~Gnit N Gn >+Gn4i; whence G, contains two disjoint events 
each at least as probable as Ga41. 


For any H >-0, G, <- H for sufficiently large n, as may be seen by 
considering some m-fold partition no element of which is more probable 
than H, and letting n be such that 2"-! > m. EG 
than H and therefore more probable than each element of the partition, 
it would follow that the union of all elements of the partition, namely 
S, is less probable than G,, which would be absurd. 


The two events By = Un Cn, Bo = (Un Dn) U (An Gn) partition B 
in the required fashion. @ 


n Were more probable 


COROLLARY 1 If <. is both fine and 
measure that almost agrees with <. 
exist partitions of S into arbitrarily 


THEOREM 4 <: is both fine and ti 
there exists a partition of S the 
is less probable than C., 


tight; the only probability 
strictly agrees with it, and there 
many equivalent events. 


ght, if and only if, for every B <-C, 
union of each element of which with B 


The proof of this theorem is easy. 


In the light of Theorems 3 and 4, I tentatively propose the following 
postulate, P6’, governing the rel 


ation < among events, and thereby 
the relation < among acts. 


Po’ If B < C, there exists a 


partition of S the union of each ele- 
ment of which with B is less pro 


bable than C. 


justify the assumption of P6’, which 
and tight, th 
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fer to stake your gain on C, rather than on the union of B and any par- 
ticular sequence of n heads and tails. For you to be able to do so, you 
need by no means consider every sequence of heads and tails edually 
probable. 

It would, however, be disingenuous not to mention that some who 
have worked on a closely related concept of probability, notably Keynes 
[K4] and Koopman [K9], [K10], [K11], would object to P6’ precisely 
because it implies that the agreement between numerical probability 
and qualitative probability is strict. Koopman, for example, holds 
that, if A D B and A # B, then A is necessarily more probable than 
B, though the numerical probability of A may well be the same as that 
of B. Thus, if a marksman shoots at a wall, it is logically contradictory 
that his bullet should fall nowhere at all, but it is logically consistent 
that a prescribed mathematically ideal point on the bullet should strike 
a prescribed mathematically ideal line on the wall. Since the event of 
the prescribed point hitting a prescribed line is logically possible, Koop- 
man would insist that the event is more probable than the vacuous 
event, namely that the bullet goes nowhere, though the numerical proba- 
bility of both events is zero. I do not take direct issue with Koopman, 
because he is presumably talking about a somewhat different concept 
of probability from the particular relation <; but I do not think it 
appropriate to suppose that the person would dis 


gain on the line than on the null set. The issue is not really either an 
because the point and line in question 


If the point and line are replaced by a 
of course, no matter how small the 
f the one hitting the other is 


tinctly rather stake a 


empirical or a normative one, 
are mathematical idealizations. 
dot, and a band, respectively, then, 


dot and band may be, the probability o ¢ 
greater than that of the vacuous event. But it seems to me entirely 


a matter of taste, conditioned by mathematical experience, to decide 
what idealization to make if the dot and band are replaced by their ideal- 
ized limits. So much for hair splitting. 
_ As far as the theory of probability per se is concerned, postulate P6’ 
is all that need be assumed, but in Chapter 5 aslightly stronger assump- 
tion will be needed that bears on acts generally, not only on those very 
Special acts by which probability is defined. ‘Therefore, Tam about to 
Propose a postulate, P6, that obviously implies P6’ and will therefore 
Supersede it. This stronger postulate seems to me acceptable for the 


Same reason that P6’ itself does. 


uence; then there exists a parti- 
dified on any one element of the 
s there, other values being un- 


ig If g <h, and f is any conseq 
tion of S such that, if g or h is so mo 
Partition as to take the value f at every 
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disturbed; then the modified g remains less than h, or g remains less 
than the modified h, as the case may require. 


4 Some mathematical details 


Are there qualitative probabilities that are both fine and tight, that : 
are fine but not tight, that are tight but not fine, that are neither fine 
nor tight but do have one and only one almost agreeing probability 
measure? Examples answering all these questions in the affirmative 
will be exhibited in this section. 

To indicate a different topic that will also be treated here, those of 
you who have had more than elementary experience with mathematical 
treatments of probability know that it is not usual to suppose, as has 
been done here, that all sets have a numerical probability, but rather 
that a sufficiently rich class of sets do so, the remainder being consid- 
ered unmeasurable. Again, it is usual to suppose that, if each of an 
infinite sequence of disjoint sets is measurable, the probability of their 
union is the sum of their probabilities, that is, probability measures 
are generally assumed to be countably additive. But the theory being 
developed here does assume that probability is defined for all events, 
that is, for all sets of states, and it does not imply countable additivity, 
but only finite additivity. The present section not only answers the 
questions raised in the preceding paragraph, but also discusses the re- 
lation of the notions of limited domain of definition and of countable 


additivity to the theory of probability developed here. The general 
conclusions of this discussion are: First, there is no technical obstacle 
to working with a limited domain of definition, and, except for exposi- 
tory complications, it might have been mildly preferable to have done 
so throughout. Second, it is a little better not to assume countable 
additivity as a postulate, but rather as a special hypothesis in certain 
contexts. A different and much m 


ore extensive treatment of these 
questions has been given by de Finetti [D4]. 

Finally, before entering upon the main technical work of this sec- 
tion, one easy question about the relation between qualitative and 
quantitative probability will be answered and several as yet unanswered 
ones will be raised, 

Are there qualitative 
ure? Yes, 


agreeing meas- 
tight is easi 


that is fine but not 
It is, however, an open 
hether a qualitative probability 
7 3 reeing measure. It would also be 
technically interesting to k 
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Are there qualitative probabilities without any almost agreeing meas- 
ure? I do not know. 

The matters to be treated in the rest of this section are rather tech- 
nical mathematically, and, though I would not delete them altogether, 
it does not seem justifiable to lay the necessary groundwork for pre- 
senting them in an elementary fashion. Some may, therefore, find it 
necessary to skip the rest of this section altogether, or to skim it rather 
lightly, 

Tt is well known that there does not exist a countably additive proba- 
bility measure defined for every subset of the unit interval, agreeing 
with Lebesgue measure on those sets where Lebesgue measure is de- 
fined, and assigning the same measure to each pair of congruent sets 
(Theorem 41, p. 276 of [H2]). On the other hand, there do exist finitely 
additive probability measures agreeing with Lebesgue measure on those 
sets for which Lebesgue measure is defined, and assigning the same 
Measure to each of any pairs of congruent sets; cf. p. 32 of [B4]. The 
existence of such measures shows, among other things, that a finitely 
additive measure need not be countably additive. Again, calling such 
a finitely additive extension of Lebesgue measure P and defining B Š C 
to mean P(B) < P(C), we see an example of a qualitative probability 


that is both fine and tight. 
An example of a qualitati 
be constructed by taking for 


of which finitely additive extensions 0 6 pa 
are defined. The generic set B in this example is therefore partitione 


into B, = B N S, and By = B N S2, respectively. For this example, 
let B <. ©; if, and only if Pi(B1) < Pa(C1), or else P:(Bi) < PUA 
and P(B) < P(C). This <» is not fine, because, for ge 

cannot be partitioned into events none of which is more probable than 


So. r he it is easily seen to be tight. 
2. On the other hand, it is y a E Gt 


Next, take S to be the union of Sı and S2 e f 
and P, as defined in the preceding example, but modify the ee 
Of <., saying B <- C; if and only if P,(B1) + P2(B2) = a T 
P2(C3), or else P1(Bi) + P2(B2) = Py(Cy) + P2(C2), and Pi(Bi) < 


Pi(C)). This is an example of a qualitative probability that is fine but 
not tight, j “ 
Combining the ideas of the two preceding Sse ane a ae 
ibi *y: . . r ig à 
ibit a qualitative probability that 1s neither fine no’ 
i itrari lly probable events. 
that S can be divided into arbitrarily many equa: g , 
Thus all the questions raised in the opening paragraph of this section 


are answered in the affirmative. 


ve probability that is tight but not fine may 
S two unit intervals, Sı and Ss, in each 
f Lebesgue measure, Pı and Po, 
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To get a feeling for the question whether literally all sets should be 
regarded as measurable, suppose that S is a cube of unit volume and 
that the probability measure P that strictly agrees with < is such that 
the probability of a parallelepiped is equal to its volume. It follows 

` that the probability of any set having Jordan content is its Jordan 
content, but, if a set has not Jordan content, a continuum of possibili- 
ties is still open. Though other possibilities are conceivable, it is not 
unnatural to consider an idealized person for whom the numerical prob- 
ability attached to each Borel set, or even each Lebesgue measurable 
set, is its Lebesgue measure. To go further and take seriously compari- 
sons between sets that are not Lebesgue measurable, or even between 
those that are not Borel measurable, seems to me to be without any 
implication bearing on reality. I suppose it might be argued, on the 
contrary, that there is no feature of reality that can properly be inter- 
preted by postulating that the person is able to compare only sets from 
a sufficiently narrow field, so that it is simpler and more elegant to ad- 
mit all sets. The question seems to be one of taste, but the following 
remark illustrates what I consider an awkwardness in supposing proba- 
bility to be attached to all sets. It would seem, at first glance, that the 
person should be able, if he is so constituted, to regard all pairs of geo- 
metrically congruent sets for which he makes any comparison at all as 
equivalent, but the famous Banach-Tarski paradox [B5] shows that 
this cannot be done if all sets are regarded as measurable. I think it a 
little more graceful to abstain from comparison between the more bi- 
zarre sets than to give up, or even much modify, my everyday notions 


about the symmetry of such probability probl 


ems associated with 
geometry. 


If one is unwilling to insist on comparison between every pair of 
sets, or events; then, in the same spirit, it is inappropriate to insist on 
comparison between every pair of acts, All that has been, or is to be, 
formally deduced in this book concerning preferences among sets, could 
be modified, mutatis mutandis, so that the class of events would not 
be the class of all subsets of S, but rather a Borel field, that is, a o-alge- 
bra, on S; the set of all consequences would be a Tosasurable space, 
that is, a set with a particular o-algebra singled out; and an act would 
be a measurable function from the measurable space of events to the 
Here Space of consequences. Indeed, the whole thing could be 

one for abstract o-algebras without reference to sets at all, and this 
might have some actual advantage, since it would make possible the 


identification of events with propositions i S ex 
ms £ £ y formal langua; 
even one unable to for P n almost an: fo; g 


cents mulate at all the complete descriptions I call 
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It may seem peculiar to insist on o-algebras as opposed to finitely 
additive algebras even in a context where finitely additive measures are 
the central object, but countable unions do seem to be essential to some 
of the theorems of § 3—for example, the terminal conclusions of Theo- 
rem 3.2 and Part 5 of Theorem 3.3. 

So much of the modern mathematical theory of probability depends 
on the assumption that the probability measures at hand are countably 
additive that one is strongly tempted to assume countable additivity, 
or its logical equivalent, as a postulate to be adjoined to P1-6. But I 
am inclined to agree with de Finetti [D2], [D4] and Koopman [K9], 
[K10], [K11] that, however convenient countable additivity may be, 
it, like any other assumption, ought not be listed among the postulates 
for a concept of personal probability unless we actually feel that its 
violation deserves to be called inconsistent or unreasonable. I know of 
no argument leading to the requirement of countable additivity, and 
many of us have a strong intuitive tendency to regard as natural proba- 
bility problems about the necessarily only finitely additive uniform den- 
sities on the integers, on the line, and on the plane. It therefore seems 
better not to assume countable additivity outright as a postulate, but 
to recognize it as a special hypothesis yielding, where applicable, a large 


class of useful theorems. 


5 Conditional probability, qualitative and quantitative 

Conditional preferences among acts in the light of a given event were 
introduced in § 2.7. Since the relation < among events has been de- 
fined in terms of the corresponding relation among acts, we may well 
expect to attach meaning to statements of the form B < C given D, 
provided that D is not null. The natural way to do so is to take a pair 
of acts f and g that test whether B < C (as prescribed by the definition 
of < between acts in § 2) and say that B < C given D, if and only if 
f<g¢g given D. Since there is more than one pair of acts f, g by which 
the proposition B < C can be tested, it is at first sight conceivable that 
not all such pairs would be in the same order given D, which would frus- 
trate the proposed definition of < given D. However, it is easily seen 
that for any f, g testing B < C, f < g given D (D not null) is equiva- 
lent to BN D<CND. Thus it is seen not only that the proposed 
definition is unambiguous, but also that it is expressible in terms of 
Probability comparisons among sets, without direct reference to acts 
at all, and, still further, that the postulates P1-6 apply to the condi- 
tional preference relation < given D among acts. This preamble suffi- 
ciently motivates the following definition and easy theorem about quali- 


tative probability relations generally. 
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If <- is a qualitative probability, and 0 <- D; then B <- C given 
D, if and only if BN D<-CN D. 


THEOREM 1 If <- is a qualitative probability, then so is <- given 


D. If in addition <- is fine or tight, then <- given D is correspondingly 
fine or tight. 


If <- is fine, then, for any D that is not null, there exists, in view of 
Theorem 3.3, one and only one probability measure P(B| D), the 
(conditional) probability of B given D, that almost agrees with <+. 
But, just as one would expect from the traditional study of numerical 
probability, and as may be easily verified, P(B N D)/P(D) considered 
as a function of B for fixed D is a probability measure that almost 
agrees with <- given D. Therefore, 


(1) P(B | D) = P(B N D)/P(D). 


As was explained in § 2.7, preference among acts given B can sug- 
gestively be expressed in temporal terms. Analogously, the comparison 
among events given B and, therefore, conditional probability given B 
can be expressed temporally. Thus P(C | B) can be regarded as the 
probability the person would assign to C after he had observed that B 
obtains. It is conditional probability that gives expression in the theory 
of personal probability to the phenomenon of learning by experience. 


In accordance with established usage, a pair of events B, C are called 
independent if P(B N C) = 


P(B)P(C). More generally, a set of events 


are called independent, if for every finite set of them, say By, +++, Ba; 
(2) P (N:B) = TI: Pa). 
Obviously, if D is not null, 


B and D are independent ; if and only if 
P(B | D) = P(B), in which case D may fairly be called irrelevant to B. 

The notions of independence and irrelevance have, so far as I can 
see, no analogues in qualitative probability; 
fortunate, for these notions seem to evoke a strong intuitive response. 
The absence of these analogues is traceable to the absence of a qualita- 
tive analogue for propositions of the form P(B| ©) < P(G| H). Work- 
ing under a rather different motivation from that which guides this 
book, B. O. Koopman [K9], [K10], and [K11] has developed a system of 
qualitative possibility in which it is meaningful to compare B given Cc 
with G given H. It is true also that for qualitative probability, even as 
ie ae here, some interconditional comparisons might be natu- 
H, it sar Tor a mple, B <- ~B given C and ~G SG pien 


I 4 easonable to establish the convention that B 
given C <- GŒ given H. This sort of extension is not, however, highly 


this is surprising and un- 
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pertinent to my purpose, for here I have little interest in qualitative 
probabilities, except as a foundation for quantitative probability. 
The following partition formula is well known and easy to prove: 


(3) P(C) = © P(C | Bi) PCBs) 
J 
where B; is a partition of S into non-null sets. If, further, C is not null, 
it is also trivial to derive the celebrated Bayes’ rule (or theorem), 
P(C | B)P(B) 
P(C) 

P(C | B)P(B;) 

= E PC|B)PB) 
J 


(4) P(B;| C) = 


Illustrations of these formulas are found in all elementary texbooks on 
probability, as well as in later sections of this book. 
Finally, if neither B nor C is null, 
(5) P(B|C) _ P(c|B) _ PENCO, 
P(B) P(C)  P(B)P(C) 


eading: Knowledge of C modifies 
wledge of B modi- 


which may be given the suggestive r | 
the probability of B by the same factor by which kno 
fies the probability of C. 


The concept of random variable enters into almost any discussion of 


probability. Experts are fairly well agreed on the following definition. 
A random variable is a function x attaching a value x(s) in some set 
X to every s in a set S on which a probability measure P is defined. t 
Such an S together with the measure P is called a probability space. 
Real-valued random variables are the most familiar, though in gen- 
eral the values X can be things of any sort. If, for example, x and y, 
with values in X and Y, respectively, are random variables on the 
same measure space, 2 new random variable z = {x, y} is defined by 
setting z(s) = {2(s), y(s)}- The values of z are thus hana of what 
is called X x Y (read the cartesian product of x and Y), the set of 
ordered pairs with first element in X and second in Y. The same sort 
of thing can be done, of course, for ordered n-tuples and also for infinite 


Sequences of random variables. 

the theory of probability, not all subsets of S or of X 
Tt is then required as part of the definition of random 
ie., that for every measurable Y C X, the set of 


t In many applications of 
are considered measurable. 
Variable that x be measurable, 
8's such that z(s) e Y be measurable. 
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Two random variables x and y defined on the same measure space S 
are called (statistically) independent; if and only if, for every Xo C x 
and Yo C Y, the two events (i.e., subsets of S) defined by the condi- 
tions x(s) = Xo and y(s) Yo, respectively, are independent.{ The 


extension of this definition from pairs to any number of random variables 
is obvious. 


6 The approach to certainty through experience 


In § 3, the theory of personal probability was, from the purely math- 
ematical point of view, reduced to that of probability measures, a sub- 
ject that has been elaborately studied, more or less explicitly, for cen- 
turies. Any mathematical problem concerning personal probability is 
necessarily a problem concerning probability measures—the study of 
which is currently called by mathematicians mathematical probability 
—and conversely. The particular outlook and interpretation implicit 
in a personalistic concept of probability leads, however, to problems 
that, though perfectly meaningful for mathematical probability, might 
not otherwise have been emphasized. This section and the succeeding 
one each briefly discuss one such problem. These two problems are 
selected from among many possibilities for the insight they provide 
into the concept of personal probability. 

Before studying these problems, it is necessary to be conversant with 
the material in Appendixes 1 and 2, which is used in the immediate 
sequel and often throughout the rest of this book. 

As was brought out in §5, the person learns by experience. ‘The 
purpose of the present section is to explore with a moderate degree of 
generality how he typically becomes almost certain of the truth, when 
the amount of his experience increases indefinitely. To be specific, 
suppose that the person is about to observe a large number of random 
variables, all of which are independent given B; for each i, where the 
Bi are a partition of S. It is to be expected intuitively, and will soon 
be shown, that under general conditions the person is very sure that 
after making the observation he will attach a probability of nearly 1 to 


n actually obtains, 
mally, let B; be a partition of S with 
P(Bi) = BQ). Let x,,r = 1,2, : 


“++, be a sequence of random variables, 
each taking on only a finite number 


o do so would raise problems of mathe- 
er interesting, are rather beside the point 


t Where not all sets are measurable, Xo and Yo must, of course, be required to 


be measurable, 
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of this book. Let x denote the first n of the random variables x,. It is 
to be borne in mind that x depends on n, so, strictly speaking, it should 
be written x(n). The assumption that, given B, the x,’s all have the 
same distribution is expressed by 


(1) P(x,(s) = 2, | B) = E@r| 4), 
where la | i) is defined by the context. Combining (1) with the as- 
sumption that the x,’s are independent given B:, 


(2) P(x| B) =p P(a(s) = {tr +++, tr} | Bd = I &(x, | i), 


where a conventional symbol has been used for equal by definition. 
These hypotheses having been laid down, it follows from Bayes’ rule 
and the partition formula (5.3) and (5.2), that 


P(x | B)P(B:) 


(3) P(B; | x) = Pa 
PODI ea, | 0) 
P(x) 
and 
(4) P(e) = DA) Iie 0. 


In connection with (3), it may be observed in passing that, if the a priori 
Probability, (i), of B; is 0, then, no matter what value z is obsorved, 
the a posteriori probability of Bi, P(B;| 2), is also 0. This is an = 
ample of the general principle that, if some event is regarded as vir- 
tually impossible, then no evidence whatsoever can lend it E 
Similarly, (3) implies the equally common-sense principle that, if an 
observation x is virtually impossible on the hypothesis (i.e. given) 
B;, and x is observed, then B; becomes virtually impossible a posteriori. 

It is particularly interesting to compare the probability of to ae 
ments of the partition, say Bi and Bə for definiteness, in the light of x. 


P(Bi| 2) _ BQ) y| D 


a Pal BD r EED 
B(1) 
nH Th RG) 
aa 
= EO re), 


B(2) 
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where self-explanatory abbreviations have been introduced. Equation 
(5) is meaningless, if both the numerator and denominator of its left- 
hand side vanish. If the denominator alone vanishes, the fraction may 
properly be regarded as infinite. This will happen; if and only if Bz is 
null, and B; is not null given x. That is, it will happen if and only if 
B(1) # 0, B(2) = 0, or if B(1) = 0, and R(z) = œ. , 

In modern statistical usage, R’(x,) and R(x) are the likelihood ratios 
of Bı to Bə given x, and z, respectively, quantities of importance in 
many theoretical contexts. 

If a person contemplates making the observation x, that is, finding 
out the value of 2(s) for the s that is the true state of the world, it may 
properly be asked how probable he considers it that R will turn out to 
have a particular value. It will be shown, barring two banal excep- 
tions, that, for n sufficiently large, the probability, given By, that X is 
greater than any preassigned number is almost 1. 
P(B;) = 0 is to be excepted, for then the conditional 
question is meaningless. The other exception occurs when £(2,| 1) = 
Eler | 2) for every ,, that is, when the common distribution of x, given 
Bı is the same as it is given Bo; for then observation of x, is simply 
irrelevant in distinguishing B, from Bz, or, a little more technically, Xr 
is irrelevant to B; given By U Bo, and 


The possibility 
l probability in 


(6) PRQ) = 1] By) = 1, 
Formally, it is to be demonstrated that, unless P(B,) = 0, or (6) 
holds, 


(7) lim P(R(2) > p| B)=1  foro< p <o. 


The problem is quite simple when account is taken of the fact that 
R(x) is the product of n r 


andom variables, R'(x,), that are independent 
given Bı. In attacking the problem, two cases are to be distinguished, 
according as there are or 


ce i are not values of x that have positive proba- 
bility given B, but zero probability given Bo. 


It is in practice rather fortunate to fi 
for then (7) applies with a vengeance, 


(8) 
Then 
(9) 


nd instances of the first case, 
Indeed, suppose that 


PRG) < o| B) =, ¢ <1, 


PR = 2| By) = 1 — gr, 


which obviously approaches 1 with increasing n. 
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The second case, namely ¢ = 1, is more interesting. Since much is 
known about sums of identically distributed independent random varia- 
bles, it is natural to investigate 


(10) log R(x) = Do log R'(x,), 


thereby replacing a product by a sum. It is easily seen from the defi- 
nition of R’(x,) that P(R!(2,) > 0| B:) = 1, so, in the case now at 
hand, the functions log R’(x,) are independent real bounded random 
variables. 


Letting 
(11) I = E(log R'(x,) | Bi), 
the weak law of large numbers } implies that, for any e > 0, 
(12) lim P(log R(x) > nl — ©) | By) = 1, 
equivalently, _ 
(18) lim P(R(x) = ee) | B= 1 


The objective will therefore be achieved, if it is demonstrated that 
I > 0 unless (6) holds. But 
(14) I = E(log R’ (2+) | Bi) 
— log E(R'™' (2) | Bi) 


[i 


IV 


— log 1 = 0, 


ll 


The inequality in the above calculation is as- 
lix 2, together with the fact that equality 
R'—(x,) is constant with probability 
one given Bı. But the expected value of R (x,) given B z equal 2 
1, as (14) asserts and as may be easily verified from the de mee 
R’~“"x,). So, barring the exceptions provided for, I> 0, an he 
demonstration of (7) is complete. 

Before the observation, the probab 
of whichever element of the partition ac 
than a is 


(15) L s@P(P(Bi | 2) > a | Bi), 


o those 7’s for which bli) #0. Applica- 
f i’s) shows that the coefficients 


as may be argued thus: 
Signed as Exercise 8 in Appendi 
can hold in (14) if and only if 


ility that the probability given x 
tually obtains will be greater 


where summation is confined t ; 
tion of (14) (extended to arbitrary pairs o 
+ For the definition of this law, see, if necessary, P- 191 of Feller’s book [F1]. 
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of each A(z) in the quantity (15), and therefore the quantity itself, t 
proaches 1 as n increases; provided only that no ‘vid Tames E(x, k: 
and £(x, | 7’) are the same, if 8(z) and (i) are both different fr om a 

To summarize informally, it has now been shown that, with the ob- 
servation of an abundance of relevant data, the person is almost gen 
tain to become highly convinced of the truth, and it has also been shown 
that he himself knows this to be the case. 

It may be remarked, for those familiar with certain theorems, that 
many refinements of (7) and its consequences could be worked out by 
application of the strong law of large numbers, the central limit theo- 
rem, and the law of the iterated logarithm to R’(x,). M 

The quantity I is coming to be called the information of the distri- 
bution of x, given B, with respect to the distribution of x, given Bo. 
More generally, if P and Q are probability measures, confined (for sim- 


plicity) to a finite set X with elements x; the information of P with 
respect to Q is defined by 


P(x) 
j P(x) 1 
(16) x (x) log aC 


x) ; 


This usage stems from work of Cc 


laude Shannon in communication en- 
gineering, a good account of wh 


ich is given in [S11]; and also from inde- 
pendent work of Norbert W iener in a related context [W10]. The ideas 
of Shannon and of Wiener, though concerned with probability, seem 
rather far from statistics. It is, therefore, something of an accident 
that the term “information” coined by them should be not altogether 
inappropriate in statistics, The situation is still further confused, be- 
cause, as long ago as 1925, R. A. Fisher emphasized 
tion, which he called “information,” 
estimation (Paper 11, Theory of statis 
glance, Fisher’s notion seems quite di 
Wiener, but, as a matter of f 
useful but rather technical ex 
formation” is given by Kullb 
topic in § 15.6. 


an important no- 
in connection with the theory of 
tical estimation in [F6]). At first 
fferent from that of Shannon and 
act, his is a limiting form of theirs. A 
Position relating the several senses of “in- 
ack and Leibler [K15], and I return to the 


7 Symmetric Sequences of events 


A problem often posed by 
of observations the unknown probability 
sort are successful. On an objectivistic 
and important, for on such a view the prol 
for example, is a property 
perimentation with the coin 


Statisticians is to estimate from a sequence 
P that repeated trials of some 
view, this problem is natural 
bability that a coin falls heads, 
of the coin that can be determined by ex- 


and in no other Way. But on a personalistice 
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view of probability, strictly interpreted, no probability is unknown to 
the person concerned, or, at any rate, he can determine a probability 
only by interrogating himself, not by reference to the external world. 
This situation has been interpreted to imply that the personalistic 
view is wrong, or at any rate inadequate, because it apparently cannot 
even express one of the most natural and typical problems of statistics. 
Thus far in this book, I have not argued against the possibility of de- 
fining some useful notion of objective probability, but have contented 
myself with presenting a particular notion of personal probability. 
Therefore, at this point it might be tempting to seek a dualistic theory 
admitting both objective and personal probabilities in some kind of ar- 
ticulation with one another. De Finetti [D3] has shown, however, 
that it is not necessary to do so, that the notion of a coin with unknown 
probability p can be reinterpreted in terms of personal probability 
alone. 
The present section is devoted to outlining this development due to 
de Finetti. In the organization of the book as a whole, it plays no logi- 
cally essential part; it is, rather, a digression intended to give a clearer 
understanding of the notion of personal probability, especially in rela- 
tion to objectivistic views. The ideas presented here are but a frag- 
ment of those on the same subject in [D2]. ; 
Let x, be a sequence of random variables taking only the values 0 
and 1. The x,’s are, to all intents and purposes, a sequence of events, 
the rth of which is the event that 2,(s) = 1. To say that these events 
are independent, each occurring with probability P, is to say that the 
probability of any finite pattern, tı, **') Ym initiating the sequence 
x,(s) is given by the formula 
(1) P(x,(s) = zr = 1,07, n| p = pa PY, 
ng the a,’s for r = 1, +++, n. 
f sequences of random variables are 
have been in the preceding section. 
ld is partitioned by Bi and that, 


where y is the number of 1’s amor 
Mixtures, in a certain sense, 0 
often of interest, as they already 


Suppose, to be explicit, that the wor v 
given B, the x,’s are independent with P(x,(s) = 1 | Bi) having some 
A 


fixed value p(i). Then the unconditional probability of a particular 
initial sequence is a mixture of the probabilities given by (1) thus: 


(2) P(a,(s) = rrah n) = by pay — p(t)" *P(Bi). 


It is natural to generalize (2) formally thus: 


(3) P(e(s) = tir = hoo”) = fpa — p)"™ dM (p), 
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where M is a probability measure on the real numbers in the interval 
[0, 1]. 

It is noteworthy that equation (3), understood to apply for every n, 
is equivalent to the condition that the probability that every n of each 
prescribed set of n of the x,’s takes the value 1 is 


a) frao. 


This follows by arithmetic induction from the obvious formula 


(5) P(x,(s) =2,;r=1,-- "j n) 


P(@,(s) = 2,37 = 1, +++, n; tn41(s) = 0) 
+ P(@-(s) = 2,37 = 1, +++, ni &naq41(8) = 1), 


which applies to any sequence of random variables taking on only the 
values 0 and 1. 

Equation (3) can very well have an interpretation in such terms that 
the measure M is not merely an abstract probability measure, but is 
actually a personal probability. Thus, if p is a random variable that 
is (for a given person) distributed according to M, and, if for each p 
the conditional distribution of the x,’s given p is independent, with 
P(x,(s) = 1) = p; then (3) obtains. Strictly speaking, the notion of 
conditional probability as it occurs in the preceding sentence is used in 
a somewhat wider sense than has been defined in this book, for the 
probability of any particular p will typically be zero. At least for 
countably additive measures, the necessary extension of conditional 
probability and conditional expectation is presented by Kolmogoroff in 
[K7]; it is a concept of the greatest value in advanced mathematical 
statistics and in probability generally. 

However, in most contexts where objectivists speak of an unknown 


probability p, there is, so far as an exclusively personalistice view of 


probability is concerned, no unknown parameter that can play the role 
of p in (3). 


Examination of situations in which “unknown” probability is ap- 
pealed to, whether justifiably or not, shows that, from the personalistice 
standpoint, they always refer to symmetric sequences of events in the 
sense of the following definition. The sequence of random variables 
Xr, taking only the values 0 and 1, is a symmetric + sequence, if and only 
if the probability that any b of the 2,(s)’s equal 1 and any c other 
x,(s)’s equal 0 depends only on the integers b and c, 


Tt De Finetti uses the French word for “equivalent.” 
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It is easy to verify that any mixture of independent sequences in the 
sense of (3) is a symmetric sequence. De Finetti has discovered that 
the converse is also true. These conclusions can be formally summarized 
thus: 


THEOREM 1 A sequence of random variables x,, taking only the 
values 0 and 1, is symmetric, if and only if there exists a probability 
measure J on the interval [0, 1] such that the probability that any pre- 
scribed n of the «,(s)’s equal 1 is given by (4). Two such measures, MW 
and JM’, must be essentially the same,} in the sense that, if B is a sub- 
interval of [0, 1], then M(B) = 1/’(B). 


Considering that de Finetti has published a proof of Theorem 1 in 
[D2] based on the Fourier integral, that any proof of it must be rather 
technical, and that the theorem is not the basis of any formal inference 
later in this book, it seems best not to prove it here.t 

It is Theorem 1 that makes it possible to express propositions re- 
ferring to unknown probabilities in purely personalistic terms. If, for 
example, a statistician were to say, “I do not know the p of this coin, 
but I am sure it is at most one half,’ ould | 
terms, “I regard the sequence of tosses of this coin as a symmetric se- 
quence, the measure M of which assigns unit measure to the interval 
[0, 3].” This condition on M means in turn that for every n the (per- 
sonal) probability of n consecutive heads is at most 2", as is easily 
verified. I do not insist that propositions couched in terms of a ficti- 
tious unknown probability are bad, if understood as suggestive abbrevi- 
ations, but only that the meaningfulness of such propositions does not 
constitute an inadequacy of the personalistic view of probability. 

The mathematical concept of probability measure or, a trifle more 
generally, bounded measure is fundamental to mathematics generally. 
Probability measures, often under other names, are, therefore, em- 
ployed in many parts of pure and applied mathematics completely un- 
related to probability proper- For example, the distribution of mass 
in a not necessarily rigid body is expressed by a bounded measure that 
tells how much of ‘the body is in each region of space. We must, there- 
fore, not be surprised if, even in studying probability ieee, wr come 
across some probability measures used not to measure probability 

ica Fa i ay ility measure” were here understood to mean a count- 
sy “pe oT nae the Borel sets of [0, 1], the theorem would re- 
M would become true uniqueness. 


i i i s of 
main true, and the essential uniquereS d very quickly and naturally by apply- 


t Technical note: Theorem 1 can be prove f 
ing the a of the Hausdorff moment problem (pp. 8-9 of [S13]) to M, but this 


Method does not seem to generalize readily. 


> that would mean in personalistic 
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proper but only for auxiliary purposes. In the event that p is not ac- 
tually an unknown parameter, the measure M presented by Theorem 1 
seems at first sight to be such a purely auxiliary measure, but, as a matter 
of fact, M does measure certain interesting probabilities, at least ap- 
proximately. For example, letting 


1 n 
(6) r=- DA Tr, 
ny 
it can be shown that 
(7) lim P(&,(s) < 6) = M(p < ô). 


In words, the person considers the average of any large number of fu- 
ture observations to be distributed approximately the way p is dis- 
tributed by M. This is an extension of the ordinary weak law of large 
numbers, proved in [D2] along with a corresponding extension of the 
strong law. 

If the first n terms of a symmetric sequence are observed, how does 
the rest of the sequence appear to the person in the light of this obser- 
vation? In the first place, it also is a symmetric sequence but generally 


of a structure different from that of the original sequence, as may be 
shown thus: Let 


(8) ry, m — y) = or Pleks) = ay; 7 = 1, «++, n), 
as one may for a symmetric sequence. Then 
(9) P(x(s) = Tq=n+1,----n+ m | x,(s) =2,,r=1, +--+, 7) 
P(xp(s) = £p, p = l, +, n + m) 
= P(x,(s) = tar = 1, <, n) 
_ ty +2, (n — y) + (m — 2) 
myn — y) 

where z is the number of 1’s among the a,’s,qg=n-+1, ---,n-+m. 
Equation (9) shows that the Sequence Xq, q > n, given that x, (s) = tr 
r= l, +++, n, is a new symmetric Sequence characterized by 
(10) 1G, m N ay +z, (n — y) + (m— 2)) 


T(Y, n — y) 
associated with the new 
ially determined by the co 


Th sure M’ i i 
e measure M sequence is, according tO 
Theorem 1, essent: 


ndition that 
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(11) fom dM'(p) = x’(m, 0) 
a(m+y,n— y) 
z(y, n — y) 


[pme = py? dM(p) 


ry, n — y) 


"(1 — p) 
= fr oo. 
my, n — y) 


Equation (11) makes it plausible that, except for the slight ambiguity 
permitted by Theorem 1, M’ is defined (for Borel sets B) by 


(12) M(B) = xy, n — y) f p'a = pr aM, 
B 


and this can in fact be demonstrated with some appeal to slightly ad- 
vanced methods pertaining to the Hausdorff moment problem (pp. 8-9 
of [S13}). 

It is noteworthy that, if M(B) = 0, then M’(B) = 0 also. In the 
event that p really is an unknown parameter, this means that, if the 
person is virtually certain that the true p is not in B, no amount of 
evidence can alter that opinion. 

Equation (12) shows that M” is generally different from M. Indeed, 

for fixed n > 1, M’ is clearly the same as M for every y for which 
z(y, n — y) > 0, if and only if M assigns the measure 1 to some one 
value of p. That is, the person regards evidence drawn from a sym- 
metric sequence as irrelevant to the future behavior of the sequence, if 
and only if at the outset he regards the sequence not merely as sym- 
metric but also as independent. 
__ It can be shown that the person r 
if he observes a sufficiently long’ segment of a sym 
Continuation of the sequence will then be one for w. 
Variance of p, 


(18) fr dM'(p) — [fe aw} 


really an unknown parameter, this 
hat after a long sequence of obser- 
bility to the immediate neigh- 
s—a parallel to the ap- 


egards it as highly probable that, 
metric sequence, the 
hich the conditional 


will be small. In the event that p is 
implies that the person is very sure t 
vations he will assign nearly unit proba 
borhood of the value of p that actually obtain: 
Proach to certainty discussed in § 6. 


CHAPTER 4 


Critical Comments 
on Personal Probability 


1 Introduction 


It is my tentative view that the concept of personal probability in- 
troduced and illustrated in the preceding chapter is, except possibly 
for slight modifications, the only probability concept essential to sci- 
ence and other activities that call upon probability. I propose in this 
chapter to discuss the shortcomings I see in that particular personal- 
istic view of probability, which, for brevity, shall here be called simply 
“the personalistic view”; to point out briefly the relationships between 
it and other views; to criticize other views in the light of it; and to dis- 
cuss the criticisms holders of oth 
pected to raise, against it. 

From the standpoint of strict logical organization such critical re- 
marks are somewhat premature, because the personalistic view itself 
insists that probability is concerned with consistent action in the face 
of uncertainty, Consequently, until the theory of such action has been 
completely outlined in later chapters, the view to be criticized cannot 


y presented. Practically, how- 
al comments to the one part of 


er views have raised, or may be ex- 


ven at the cost of some repetition. Thus, 
e has already been said in the introductory 
some of it will be said again. 


onalistie view are to be discussed here, but 


some of what is to be said her 


t Much more extensive co 
[N1], and by Carnap [C1]. 
nection, 


mparative material is given by 


Keynes [K4], by Nagel 
Koopman [K12] should also be 


mentioned in this con- 
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have been accumulated. A less obvious, but I think no less important 
and legitimate, function is to cast new light on the personalistic view, 
especially for those who already hold, or tend to hold, other views. 


2 Some shortcomings of the personalistic view 


I can answer, to my own satisfaction, some criticisms of the personal- 
istic view that have been brought to my attention. These points are 
discussed later in the chapter, but in this section I state and discuss 
as clearly as I can those that I find more difficult and confusing to 
answer. 

According to the personalistice view, the role of the mathematical 
theory of probability is to enable the person using it to detect incon- 
sistencies in his own real or envisaged behavior. It is also understood 
that, having detected an inconsistency, he will remove it. An incon- 
sistency is typically removable in many different ways, among which 
the theory gives no guidance for choosing. Silence on this point does 
not seem altogether appropriate, so there may be room to improve the 
theory here. Consider an example: The person finds on interrogating 
himself about the possible outcome of tossing a particular coin five 
times that he considers each of the thirty-two possibilities equally 
probable, so each has for him the numerical probability 1/32. He also 
finds that he considers it more probable that there will be four or five 
heads in the five tosses than that the first two tosses will both be heads. 
Now, reference to the mathematical theory of probability soon shows 
the person that, if the probability of each of the thirty-two possibilities 
is 1/32, then the probability of four or five heads out of five is 6/32, 
and the probability that the first two tosses will be heads is 8/32, so 
the person has caught himself in an inconsistency. The theory does not 
tell him how to resolve the inconsistency; there are literally an infinite 
number of possibilities among which he must choose. : 

In this particular example, the choice that first comes to my mind, 
and I imagine to yours, is to hold fast to the position that all thirty-two 
possibilities are equally likely and to accept the implications of that 
position, including the implication that four or five heads out of five 
is less probable than two heads out of two. F I do not think that there is 
any justification for that choice implicit in the example as formally 
stated, but rather that in the sort of actual situation of which the ex- 
ample is a crude schematization there generally are considerations not 
incorporated in the example that do justify, or at any rate elicit, the 
choice. 


To approach the matter in 
be some probability relations a 


a somewhat different way, there seem to 
bout which we feel relatively “sure” as 
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compared with others. When our opinions, as reflected in real or en- 
visaged action, are inconsistent, we sacrifice the unsure opinions to the 
sure ones. The notion of “sure” and “unsure” introduced here is vague, 
and my complaint is precisely that neither the theory of personal proba- 
bility, as it is developed in this book, nor any other device known to me 
renders the notion less vague. There is some temptation to introduce 
probabilities of a second order so that the person would find himself 
saying such things as “the probability that B is more probable than C 
is greater than the probability that F is more probable than G.” But 
such a program seems to meet insurmountable difficulties. 

The first of these—pointed out to me by Max Woodbury—is this. 
If the primary probability of an event B were a random variable b 
with respect to secondary probability, then B would have a “composite” 
probability, by which I mean the (secondary) expectation of b. Com- 
posite probability would then play the allegedly villainous role that 


secondary probability was intended to obviate, and nothing would have 
been accomplished. 


Again, 


once second order probabilities are introduced, the introduc- 
tion of an 


endless hierarchy seems inescapable. 
very difficult to interpret, and it seems at best 
realistic, not more. 


Finally, the objection concerning composite probability would seem 
to apply, even if an endless hierarchy of higher order probabilities were 
introduced. The composite probability of B would here be the limit 
of a sequence of numbers, En(Ena(-++ Ho(P,(B)).- -)), a limit that 
could scarcely be postulated not to exist in any interpretable theory of 
this sort. The reader may wish to evaluate for himself the arguments 
in favor of such a hierarchy put forward by Reichenbach (Chapter 8, 
[R2]), taking proper account of the differences between Reichenbach’s 
overall view, and his mathematical theory, of probability on one hand 


and, on the other, the personalistic view and measure-theoretic mathe- 
matical theory that are the basis of my critique of higher order proba- 
bilities. 


Such a hierarchy seems 
to make the theory less 


The interplay between the “sure” and “ 
pressed by de Finetti (p. 60, [D2]}) thus: “T 
of a probability is not always possible is j 
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the probability of heads on the first toss of a certain penny is 3, it does 
not at all follow that he considers the coin fair. He might, to take an 
extreme example, be convinced that the penny is a trick one that al- 
ways falls heads or always falls tails. 

Logic, to which the theory of personal probability can be closely par- 
alleled, is similarly incomplete. Thus, if my beliefs are inconsistent 
with each other, logic insists that I amend them, without telling me how 
to do so. This is not a derogatory criticism of logic but simply a part 
of the truism that logic alone is not a complete guide to life. Since the 
theory of personal probability is more complete than logic in some re- 
spects, it may be somewhat disappointing to find that it represents no 
improvement in the particular direction now in question. 

A second difficulty, perhaps closely associated with the first one, 
stems from the vagueness associated with judgments of the magnitude 
of personal probability. The postulates of personal probability imply 
that I can determine, to any degree of accuracy whatsoever, the proba- 
bility (for me) that the next president will be a Democrat. Now, it is 
manifest that I cannot really determine that number with great accu- 
racy, but only roughly. Since, as is widely recognized, all the interest- 
ing and useful theories of modern science, for example, geometry, rela- 
tivity, quantum mechanics, Mendelism, and the theory of perfect com- 
Petition, are inexact; it may not at first sight seem disquieting that the 
theory of personal probability should also be somewhat inexact. As 
will immediately be explained, however, the theory of personal proba- 
bility cannot safely be compared with ordinary scientific theories in 


this respect. 
I am not familiar with an, 
is only slightly inexact or is almo 


y serious analysis of the notion that a theory 
; st true, though philosophers of science 
have perhaps presented some. Even if valid analyses of the nono 
have been made, or are made in the future, for the ordinary theories of 
science, it is not to be expected that those analyses will be immediately 
applicable to the theory of personal probability, normatively inter- 
preted; because that theory is a code of consistency foy = oe ap- 
plying it, not a system of predictions about the ro im. 

The difficulty experienced in § 2.6 with defining ear Te seems 
closely associated with the difficulty about vagueness mee ieee 

Another difficulty with the theory of personal — z (or , - 
Properly, with that larger theory of the behavior : apa ir in t e 
face of uncertainty, of which the theory of persona re ility e a 
part) is that the statement of the theory 1s not yet negepsariy oomp ete, 
Thus we shall in the next chapter come upon another pr oposition that 
demands acceptance as a postulate, and, since even this leaves the per- 
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son a great deal of freedom, there is no telling when someone will come 
upon still another postulate that clamors to be adjoined to the others. 
Strictly speaking, this is not so much an objection to the theory as a 
warning about what to expect of its future development. 


3 Connection with other views 


All views of probability are rather intimately connected with one an- 
other. For example, any necessary view can be regarded as an extreme 
personalistic view in which so many criteria of consistency have been 
invoked that there is no role left for the person’s individual judgment. 
Again, objectivistic views can be regarded as personalistic views ac- 
cording to which comparisons of probability can be made only for very 
special pairs of events, and then only according to such criteria that all 
(right-minded) people agree in their comparisons. 

From a different standpoint, personalistic views lie not between, but 
beside, necessary and objectivistic views; for both necessary and objec- 
tivistic views may, in contrast to personalistic views, be called objective 
in that they do not concern individual judgment. 


4 Criticism of other views 


It will throw some light on the personalistic view to say briefly how 
some other views seem to compare unfavorably with it. 

It is one of my fundamental tenets that any satisfactory account of 
probability must deal with the problem of action in the face of uncer- 
tainty. Indeed, almost everyone who seriously considers probability, 
especially if he has practical experience with statistics, does sooner or 
later deal with that problem, though often only tacitly. Even some 
personalistic views seem to me too remote from the problem of action, 
or decision. For example, de Finetti in [D2] gives two approaches to 
personal probability. Of these, one is almost exactly like the view 
sponsored here, except only that the notion “more probable than” is 
supposed to be intuitively evident to the person, without reference to 
any problem of decision. The other is more satisfactory in this re- 
spect, being couched in terms of betting behavior, but it seems to me 
a somewhat less satisfactory approach than the one sponsored here, be- 
cause it must assume either that the bets are for infinitesimal sums or— 
anticipating the language of the next chapter—that the utility of money 
is linear. The theory expressed by Koopman in [K9], [X10], and [K11] 
and that expressed by Good in [G2] are both personalistic views that 
tend to ignore decision, or at any rate keep it out of the foreground; 


but the personalistice view expressed by Ramsey in [R1], like the one 


sponsored here, takes decision as fundamental. If any necessary vieW 
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can be formulated at all, it might well be possible to formulate it in 
terms of decision, but, so far as I know, the notion of decision has not 
appeared fundamental to the holders of any necessary view. It seems 
fair to say that objectivistic views, by their very nature, must in prin- 
ciple regard decision as secondary to probability, if relevant at all. 
Yet, the objectivist A. Wald has done more than anyone else to popu- 
larize the notion of decision. 

As has already been indicated, from the position of the personalistic 
view, there is no fundamental objection to the possibility of construct- 
ing a necessary view, but it is my impression that that possibility has 
not yet been realized, and, though unable to verbalize reasons, I con- 
jecture that the possibility is not real. Two of the most prominent en- 
thusiasts of necessary views are Keynes, represented by [K4], and Car- 
nap, who has begun in [C1] to state what he hopes will prove a satis- 
factory necessary (or nearly necessary) view of probability. Keynes 
indicated in the closing pages of [K4] that he was not fully satisfied 
that he had solved his problem and even suggested that some element 
of objectivistic views might have to be accepted to achieve a satisfac- 
tory theory, and Carnap regards [C1] as only a step toward the estab- 
lishment of a satisfactory necessary view, in the existence of which he 
declares confidence. That these men express any doubt at all about the 
possibility of narrowing a personalistic view to the point where it be- 
comes a necessary one, after such extensive and careful labor directed 
toward proving this possibility, speaks loudly for their integrity; at the 
same time it indicates that the task they have set themselves, if possi- 
ble at all, is not a light one. AE 

Keynes, writing in 1921 of what are here called objectivistic views, 
complained, “The absence of a recent exposition of the logical basis of 
the frequency theory by any of its adherents has been a great disadvan- 
tage to me in criticizing it.” (Chap. VIII, Sec. 17, of [K4])). I believe 
that his complaint applies as aptly to my position today as to his then, 
though I cannot pretend to have combed the intervening literature 
with anything like the thoroughness Keynes himself would have em- 
ployed. ` Reichenbach, to be sure, presents in great detail an interest- 
ing view that must be classified as objectivistic [R2], but it seems far 


removed from those that dominate modern statistical theory and form 
the main subject of the following discussion. Whatever objectivistic 
holders of necessary and personalistic 


views may be, they seem, to hoic Pree 
views alike subject to two major lines of criticism. ln eenas place, 
Obiectivistio views typically attach probability only to very special 


events. Thus, on no ordinary objectivistic view would it be meaning- 
ful, let alone itd to say that on the basis of the available evidence it 
3 2 
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is very improbable, though not impossible, that France will become a 
monarchy within the next decade. Many who hold objectivistic views 
admit that such everyday statements may have a meaning, but they 
insist, depending on the extremity of their positions, that that meaning 
is not relevant to mathematical concepts of probability or even to sci- 
ence generally. The personalistice view claims, however, to analyze 
such statements in terms of mathematical probability, and it considers 
them important in science and other human activities. f 

Secondly, objectivistic views are, and I think fairly, charged with 
circularity. They are generally predicated on the existence in nature 
of processes that may, to a sufficient degree of approximation, be rep- 
resented by a purely mathematical object, namely an infinite sequence 
of independent events. This idealization is said, by the objectivists 
who rely on it, to be analogous to the treatment of the vague and ex- 
tended mark of a carpenter’s pencil as a geometrical point, which is 80 
fruitful in certain contexts. When it is pointed out to the objectivist 
that he uses the very theory of probability in determining the quality 
of the approximation to which he refers, he retorts that the applied 
geometer—a fictitious character whose reputation for solidity in science 
is unquestioned—likewise uses geometry in determining the quality of 
his approximations. Let the geometer then be challenged, and he re- 
plies with a threefold reference to experience, saying, “It is a common 
experience that with sufficient experience one develops good judgment 
in the use of geometry and thenceforth generally experiences success in 
the predictions he bases on it.” “Now,” says the objectivist, ‘the 
geometer’s answer is my answer.” But it seems to critics of objectivistic 
views that, though the geometer may be entitled to make as many allu- 
sions to experience as he pleases, the probabilist is not free to do 80, 
precisely because it is the business of the probabilist to analyze the con- 
cept of experience. He, therefore, cannot properly support. his position 
by alluding to experience until he has analyzed that concept, though 
he can, of course, allude to as many experiences as he wishes, 

Two sorts of mixed views call for special comment here. 

First, some (among them Carnap [C1]; Koopman [K9], [K10], and 
(X11); and Nagel [N1]) hold that two probability concepts play a role 
in inference, an objectivistic one and a personalistic or a necessary one- 
This dualism is typically justified as necessary to the analysis of such 
a concept as that of a coin with unknown probability of falling heads. 
But, as § 3.7 explains, de Finetti has provided a satisfactory analysis 
on the basis of personal probability alone, 

Second, others—for example, van Danzig [V1] and Féraud [F2]— 


finding the conventional objectivistic views circular for ihe reasons T 
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have cited, try to break the circle by relatively isolated use of subjec- 
tive ideas. Very crudely, it seems to be their position that in any one 
context it is allowable for a person to act as though some one event of 
sufficiently small (objective) probability, chosen at his discretion, were 
impossible. Quite apart from the relatively technical question of 
whether any consistent mixed view of this kind can be constructed, 
holders of personalistic and necessary views alike criticize them as un- 
necessarily timid, for they embrace subjective ideas, but only gingerly. 


5 The role of symmetry in probability 

An important and highly controversial question in the foundations 
of probability is whether and, if so, how symmetry considerations can 
determine the probabilities of at least some events. 

Symmetry considerations have always been important in the study 
of probability. Indeed, early work in probability was dominated by 
the notion of symmetry, for it was usually either concerned with, or di- 
rectly inspired by, symmetrical gambling apparatus such as dice or 
cards. To illustrate those classical problems, suppose that a gambler is 
offered several bets concerning the possible outcome of rolling three 
dice, where it is to be understood that refraining from any bets at all 
may be among the available “bets.” Which of the available bets 
should the gambler choose? Perhaps I distort history somewhat in in- 
sisting that early problems were framed in terms of choice among bets, 
for many, if not most, of them were framed in terms of equity, that is, 
they asked which of two players, if either, would have the advantage 
in a hypothetical bet. But, especially from the point of view of the 
earlier probabilists, such a question of equity is tantamount to a ques- 
tion of choice among bets, for to ask which of two “equal” betters has 
the advantage is to ask which of them has the preferable alternative, 
as was pointed out quite explicitly by D. Bernoulli in [B10]. , 

In effect, the classical workers recommended the following solution 
to the problem of three dice, with corresponding solutions to other 
gambling problems: Pix , X 

1. Attach equal mathematical probabilities to each of the 216 (= 6°) 
possible outcomes of rolling the three dice. (There are 6” possibilities, 
because the first, second, and third dice can each show any of six scores, 
all combinations being possible.) 

2. Under the mathematical pro é 
pute the expected winnings (possibly negati 
available bet. 

3. Choose a bet that has the | 
available. 


ability established in Step 1, com- 
ve) of the gambler for each 


argest expected winnings among those 
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At present it is appropriate to refrain from criticisms of the use 
made of expected winnings until the next chapter and to concentrate 
discussion on the notion that the 216 possibilities should be considered 
equally probable, which can conveniently be done by drastically reduc- 
ing the class of bets considered to be available. Say, for definiteness, 
that the only bets to be considered are simply even-money bets of one 
dollar, that the triple of scores falls in a preassigned subset of the 216 
possibilities. When attention is focused on this restricted class of bets, 
the total recommendation is seen to imply that the probability measure 
defined in the first step of the recommendation be adopted as the per- 
sonal probability of the gambler. To put it differently, a gambler who 
adopts the recommendation will hold the 216 possible ou 
probable not only in some abstract sense, but also in t 
sonal probability as defined in § 3.2. 

The notion that the 216 possibilities should be regarded as equally 
probable is familiar to everyone; for it is taken for granted wherever 
gentlemen gamble as well as in the standard high-school algebra courses, 
where it serves to illustrate the theory of combinations and permutations. 

Traditionally, the equality of the probabilities was supposed to be 
established by what was called the principle of insufficient reason,{ 
thus: Suppose that there is an argument leading to the conclusion that 
one of the possible combinations of ordered scores, say {1, 2, 3}, is 
more probable than some other, say {6, 3, 4}. Then the information 
on which that hypothetical argument is based has such symmetry as 
to permit a completely parallel, and therefore equally valid, argument 
leading to the conclusion that {6, 3, 4} is more probable than {1, 2, 3}. 
Therefore, it was asserted, the probabilities of all combinations must 
be equal. 

The principle of insufficient reason has been and, I think, will con- 
tinue to be a most fertile idea in the theory of probability ; but it is not 
so simple as it may appear at first sight, and criticism has frequently 
Holders of necessary views typi- 
a rigorous basis by modifying it 
h criticism. Holders of personal- 
regard the criticism as not alto- 


tcomes equally 
he sense of per- 


€ apparatus in ques- 


principle of insufficient reason should be called the 
ection 3 of [B15] for the distinction involved. 


+ Perhaps what I here call the 
principle of cogent reason. See S 
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tion, or even with similar apparatus. Thus, attempts to use the prin- 
ciple, as I have stated it, to prove that there is no such thing as a run 
of luck at dice, as actually played, are invalid. The person may have 
had relevant experience, directly or vicariously, not only with gambling 
apparatus itself, but also with people who make and handle it, including 
cheaters. 

It is not always obvious what the symmetry of the information is in 
a situation in which one wishes to invoke the principle of insufficient 
reason. For example, d’Alembert, an otherwise great eighteenth-cen- 
tury mathematician, is supposed to have argued seriously that the prob- 
ability of obtaining at least one head in two tosses of a fair coin is 2/3 
rather than 3/4. (Cf. [T3], Art. 464.) Heads, as he said, might appear 
on the first toss, or, failing that, it might appear on the second, or, 
finally, might not appear on either. D’Alembert considered the three 
possibilities equally likely. 

It seems reasonable to suppose that, if the principle of insufficient 
reason were formulated and applied with sufficient care, the conclusion 
of d'Alembert would appear simply as a mistake. There are, however, 
more serious examples. Suppose, to take a famous one, that it is known 
of an urn only that it contains either two white balls, two black balls, 
or a white ball and a black ball. The principle of insufficient reason has 
been invoked to conclude that the three possibilities are equally proba- 
ble, so that in particular the probability of one white and one black 
ball is concluded to be 1/3. But the principle has also been applied to 
conclude that there are four equally probable possibilities, namely, that 
te and the second also, that the first is white and the 
second black, etc. On that basis, the probability of one white and one 
black ball is, of course, 1/2. Personally, I do not try to arbitrate be- 
tween the two conclusions but consider that the existence of the pair 
of them reflects doubt on the notion that a person’s knowledge relevant 
to any matter admits any full and precise description in terms of 
propositions he knows to be true and others about which he knows 
orn A of personalistice views do not find the principle of in- 
sufficient reason compelling, because they envisage the possibility that 
a person may consider one event more probable than another without 
g argument for his attitude. yi iewed practically, 
associated with the first criticism of the principle 
he holder of a personalistice view typically 
he influence of experience, and pos- 
nheritance, that expresses itself in 
hrough compelling argument. 


the first ball is whi 


having any compellin 
this position is closely 
of insufficient reason, for t 
supposes that the person is under tl 
sibly even biologically determined i 
his opinions, though not necessarily t 
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Holders of personalistice views do see some truth in the principle of 
insufficient reason, because they recognize that there are frequently par- 
titions of the world, associated with symmetrical-looking gambling ap- 
paratus and the like, that many and diverse people all consider (very 
nearly) uniform partitions. As was illustrated in the preceding sec- 
tion, we often feel more “sure” about probabilities derived from the 
judgment that such partitions are uniform than we do about others. 
Such partitions are, moreover, very important in that they provide 
some events the probability of which to diverse people is in agreement. 
Though the events concerned are often of no importance in themselves, 
agreement about them can, through the statistical invention of ran- 
domization, contribute to agreement about all sorts of issues open to 
empirical investigation. Widespread though the agreement about the 
near uniformity of some partitions is, holders of personalistice views 
typically do not find the contexts in which such agreement obtains 
sufficiently definable to admit of expression in a postulate. 

Holders of purely objectivistic views see no sense at all in the original 
formulation of the principle of insufficient reason, for it uses ‘proba- 
bility” in a manner they consider meaningless. But they too see an 
element of truth in the principle, which they consider to be established 
as a part of empirical physics. Thus, for example, they regard it as an 
experimental fact, admitting some explanation in terms of theoretical 
physics, that three dice manufactured with reasonable symmetry will 
exhibit each of the 216 possible patterns with nearly equal frequency, 
if repeatedly rolled with sufficient violence on a suitable surface. 

Holders of personalistice views agree that experiments or, more gen- 
erally, experiences determine to a large extent when people employ the 
idea of insufficient reason. Thus, though experiments with gambling 
apparatus, quite apart from gambling itself, have a fascination that 
perhaps exceeds their real interest, such experiments are not altogether 
worthless. On the one hand, they provide strong evidence that a per- 
son cannot expect to maintain a symmetrical attitude toward any piece 
of apparatus with which he h 


L as had long experience, unless he is vir- 
tually convinced at the outset that the possible s 


are equally probable and inde 
the more familiar and someti 
probability, long experiments with coins, dice, cards 
always shown some bias, and often some dependence from trial to trial. 
On the other hand (and 
has been shown that, wit 
its statistical equivalent, 
the dependence from trial 
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that groups of very diverse people can be brought to agree that repeated 
trials with certain apparatus are nearly uniform and nearly independent. 
Thus certain methods of obtaining random numbers and other outcomes 
of uniform and independent trials, which are vital to many sorts of 
experimentation, have justifiably found acceptance with the scientific 
public. A stimulating account of practical methods of obtaining ran- 
dom numbers, and random samples generally, is given by Kendall in 


Chapter 8 (Vol. I) of [K2]. 
6 How can science use a personalistic view of probability? 


It is often argued by holders of necessary and objectivistic views alike 
that that ill-defined activity known as science or scientific method con- 
sists largely, if not exclusively, in finding out what is probably true, 
by criteria on which all reasonable men agree. The theory of proba- 
bility relevant to science, they therefore argue, ought to be a codifica- 
tion of universally acceptable criteria. Holders of necessary views say 
that, just as there is no room for dispute as to whether one proposition 
is logically implied by others, there can be no dispute as to the extent 
to which one proposition is partially implied by others that are thought 
of as evidence bearing on it, for the exponents of necessary views re- 
gard probability as a generalization of implication. Holders of objec- 
tivistic views say that, after appropriate observations, two reasonable 
people can no more disagree about the probability with which trials 
in a sequence of coin tosses are heads than they can disagree about the 
length of a stick after measuring it by suitable methods, for they con- 
sider probability an objective property of certain physical systems in 
the same sense that length is generally considered an objective property 
of other physical systems, small errors of measurement being contem- 
plated in both contexts. Neither the necessary nor the objectivistic 
outlook leaves any room for personal differences; both, therefore, look 
tie view of probability as, at best, an attempt to pre- 


on any personalistice V top 
£ vior of abnormal, or at any rate unscientific, 


dict some of the beh 
people. 

I would reply | 
sally acceptable criteria t ; 
and that, when any criteria that may ^ pclae 2 i 
forward "they will be welcomed into the personalistic view. The cri- 

? 


teria incorporated in the personalistic view do not oer sett 
on all questions among all honest and freely coe Se e, 
even in principle. That incompleteness, if one ne T i P 0, o not 
distress me, for I think that at least some of i je dinae eemen we ees 
around us is due neither to dishonesty, to errors in reasoning, nor to 


that the personalistic view incorporates all the univer- 
for reasonableness 1n judgment known to me 
ave been overlooked are brought 
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friction in communication, though the harmful effects of the latter are 
almost incapable of exaggeration. 

As was mentioned in connection with symmetry, there are partitions 
that diverse people all consider nearly uniform, though not compelled 
to that agreement by any postulate of the theory of personal proba- 
bility. As has also been mentioned and as will be explained later (es- 
pecially in § 14.8), through the statistical invention of randomization, 
agreement about partitions pertaining to gambling apparatus of no im- 
portance in itself can be made to contribute to agreement in every 
part of empirical science. 

Another mechanism that brings people having some, but not all, 
opinions in common into more complete agreement was illustrated in 
$$ 3.6-7. Indeed, it was there shown that in certain contexts any two 
opinions, provided that neither is extreme in a technical sense, are al- 
most sure to be brought very close to one another by a sufficiently 
large body of evidence. 

It has been countered, I believe, that, if experience systematically 
leads people with opinions originally different to hold a common opinion, 
then that common opinion, and it only, is the proper subject of scien- 
tifie probability theory. There are two inaccuracies in this argument. 
In the first place, the conclusion of the personalistic view is not that 
evidence brings holders of different opinions to the same opinions, but 
rather to similar opinions. In the second place, it is typically true of 
any observational program, however extensive but prescribed in ad- 
vance, that there exist pairs of opinions, neither of which can be called 
extreme in any precisely defined sense, but which cannot be expected, 
either by their holders or any other person, to be brought into close 
agreement after the observational program. 

T have, at least once, heard it objected against the personalistic view 
of probability that, according to that view, two people might be of 
different opinions, according as one is pessimistic and the other opti- 
mistic. I am not sure what position I would take in abstract discussion 
of whether that alleged property of personalistic views would be ob- 
jectionable, but I think it is clear from the formal definition of qualita- 
tive probability that the particular personalistic view sponsored here 
does not leave room for optimism and pessimism, however these traits 
be interpreted, to play any role in the person’s judgment of probabilities. 


CHAPTER 5 


Utility 


1 Introduction 


The postulates P4-6, introduced in Chapter 3, have already led to 
simplification of the relation < in so far as it applies to acts of a special 
but important form. Indeed, through the introduction of numerical 
probability, those special comparisons have been reduced to ordinary 
arithmetic comparison of numbers in such a way that many relations 
among acts are deducible by simple and systematic arithmetic calcula- 
tion. In this chapter it will be shown that the arithmetization of com- 
parison among acts can, with the introduction of one mild new postu- 
late, be extended to virtually all pairs of acts. 

This far-reaching arithmetization of comparison among acts is 
achieved by attaching a number U(f) to each consequence f in such a 
way that f < g if and only if the expected value of U(f) is numerically 
less than or equal to that of U(g), provided only that the real-valued 
functions U(f) and U(g) are essentially bounded. The provision can 
fail to be met only if there exist acts that are, so to speak, distinctly 
preferable to any fixed reward or distinctly worse than any fixed punish- 
ment. 

A function U that thus arithmetizes the relation of preference among 
acts will be called a utility. It will be shown that the multiplicity of 
utilities is not complicated, every utility being simply related to every 
other, I have chosen to use the name “utility” in preference to any 


other, in spite of some unfortunate connotations this name has in con- 


nection with economic theory, because it was adopted by von Neumann 


and Morgenstern when in [V4] they revived the concept to which it re- 
fers, in a most stimulating way. Their treatment has been of such wide- 
spread interest that the introduction of a name other than “utility” at 
the present time would cause more confusion than it could alleviate. 
The next three sections are concerned with the technical exploration 
of the utility concept. I think readers interested in the details will find 
it best to read these sections twice as a unit, in the fashion I have been 
recommending for other material in which definitions and propositions 
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are interlarded with proofs; others will be content with a cursory read- 
ing, omitting proofs. 

Taking advantage of the simplicity afforded by the introduction of 
utility, I try in §5 to make some progress with the problem, pointed 
out in § 2.5, of specifying criteria for the construction of “small worlds.” 

Finally, § 6 briefly reports the history of the utility idea. A separate 
critical section is not necessary, because the criticisms of the theory of 
utility known to me are incorporated conveniently into the historical 
section. 


2 Gambles 


Before discussing utility, it is expedient to establish certain facts, 
the first being that at least among a rather rich class of acts, namely 
acts confined with probability one to a finite number of consequences, 
preference depends only on the probability distribution of the conse- 
quences of the acts. 

THEOREM 1 


Hyp. 1. fi, +++; fn are n elements of Fw > A, 
2. pi, +++, Pn are numbers such that Zp; = 1 
3. g and h are acts such that 


P(g(s) = fi) = PQs) =f) =e, i=l, cy le 
Conc. g=h, 


Proor. The theorem is obvious for n = 1. It will be proved by in- 
duction, supposing henceforth that n > 1. 

Let B denote the intersection of the two events that g(s) = fn and 
h(s) # fx, and let C denote the intersection of the two events that 
h(s) = fn and g(s) ¥ fy. It is easy to see that P(B) = P(C). C can 
be partitioned into Co, Cy, «++, Cn—ı, where Co is a null event and Ci, 
i= 1, +++, n — 1, is the intersection of C with the event that g(s) = fi 
By repeated application of Conclusion 7 of Theorem 3.3.3, B can be 
partitioned into events Bo, Bi, +++, Ba—ı such that P(B) = P(C)), 
t=0,-+-,n—1, 

Let go = g, and define &i41 step by step for i = 0, +++, n — 2 thus: 
(1) Giri(8) = fn for s e Cigi, 

= tiga for s e Biss, 
= g;(s) elsewhere. 


It is easily seen from the facts of conditional probability that giqa = 
g: given By4, U C;44, and it is even more obvious that g:4, = g; given 
~(Bis1 U Ci41). Therefore Sit1 = gi SO gn—ı =g. Furthermore, 
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P(gi4a(s) = fi) = P(gils) = fi) = pi so P@n—1(8) = fi) = p j= 1, 

-+,n. Thus g,_1 is not only equivalent to g but also satisfies the hy- 
pothesis of the theorem relative to h, so it will suffice to prove the theo- 
rem for g,_; and h in place of g and h. 

Now g»—ı has been constructed to equal fn in C, except on a null set. 
Therefore ga—ı = h given C U D, where D is the subset of ~C on 
which ga = h = fy. 

It remains only to show that g,_1 = h given ~(C U D). If~(C U D) 
is null, that is true automatically; henceforth concentrate on the less 
trivial situation. If ~(C U D) is not null, then < given ~(C U D) 
satisfies all the postulates assumed thus far, and therefore the conse- 
quences fi, +-+, fa—1; the numbers p; = p;/(1 — pn), t= 1, n — 1; 
the acts g,—1 and h; and the relation < given ~(C U D) satisfy the 
hypothesis of the theorem for a case in which it is supposed already to 
have been proved. @ 


In this chapter the notation 2p;f; will denote the class of all acts f 
for which there exist partitions B; of s such that P(B;) = pi and f(s) = 
fi for pe B; Here the fys are a finite sequence of consequences (not 
necessarily distinct), and the p;’s a corresponding sequence of non- 
negative real numbers such that Zp; = 1. In view of Conclusion 7 of 
Theorem 3.3.3, such a class of acts, which will in this chapter be re- 
ferred to as a gamble and denoted by f, g, h, or the like, always has at 
least one element. Theorem 1 says, in effect, that the person regards 
all elements of any gamble as equivalent. To put it differently, if the 
events B; of a partition have the probabilities pi, and if the act f is 
such that the consequence f; will befall the person in case B; occurs, 
then the value of f is independent of how the partition B; is chosen, 

Gambles can be mixed, in a sense, to make new gambles, thus: Let 
f; be a finite sequence of gambles, 

(2) fi = DL pifi 

i 
and gj a corresponding sequence of non-negative real numbers such 
that Yo; = 1. The mixture of the fps with weights cj, denoted Da;f;, is 


defined by 


(3) Sof; = Z oj 22 pafi} 


i 
= Di (ospisdfiis 
ij 
which is meaningful, the fiz’s being consequences and the (o;p;;)’s being 


numbers such that =(ojpij) = 1. Such mixtures are exemplified by an 
insurance policy in which the benefit is an annuity payable during the 
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life of the beneficiary, and by a lottery in which the prizes are tickets 
in other lotteries. 

In view of Theorem 1, it is natural to say that f < g means that, for 
every act f in the class of acts corresponding to f, f < g. Corresponding 
definitions are to be understood for f < g, f < g, f < g, ete. 


THEOREM 2 If f, g, and h are gambles, and 0 < p < 1; then pf + 
(1 — p)h < pg + (1 — p)h, if and only if f < g. 


Proor. Let f, g; fi, gj; and B;, C; be acts, consequences, and parti- 
tions such that f and g are among the acts represented by f and g, re- 
spectively, with f(s) = f; for s ¢ B; and g(s) = g; for s e C}. 

Construct Dj; C B; N C; such that P(D;;) = pP(B; N C;), and let 
D=UDi;. Then P(D) = p, P(B;| D) = P(B), and P(C;| D) = 
P(C). 

What is to be proved is, in effect, that f < g given D, if and only if 
f <g. In view of Theorem 1 it is clear that whether that is so or not 
for f and g does not depend on the particular choice of D; so, with an 
obvious temporary extension of terminology, it is to be proved that f < g 
given p, if and only if f < g. 

If f = g given a for every 0 <a < 1, there is nothing to prove. 
Otherwise it can be assumed without loss of generality that, for some 
ao, Í < g given ap. 

In view of Theorem 2.7.2, if «+8 < 1, f > g given a, and f > g 
given £; then f > g given (a + 8), and similarly f > g given a/2. 

Making use of P6 and Theorem 2.7.2, it can easily be shown that, for 
any a sufficiently close to ao, f < g given a. 

The preceding three paragraphs imply that, in the case at hand, 
f < g given a for every a, 0 <a<1.@ 


Taeorem3 If F< g, and 0<o <p <1, then pf + (1 — pg < 
of + (1 — o)g. 


Proor. In view of the immediately verifiable identities, 
pf + (1 — p)g = (p — o)f + [1 — (p — a) X 


| o f a= he 
(4) l1 = eo) eer er 


of + (1 — o)g = (p — o)g + [1 — (p — 0) X 


EAEE Er 


———S ee 
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this theorem is a special case of Theorem 2; unless p = 1, and ø = 0, 
in which case it is trivial. @ 


THEOREM 4 If fi < f and fı < g < fo, then there is one and only 
one p such that pf, + (1 — p)fe = g. 


Proor. It follows immediately from Theorem 3 and the principle of 
the Dedekind cut t that there is one and only one po such that 


of, + (1 — o)f < g, if o> po 


6) i 
of, + (1 — ao) > g, if o< po. 


According to (5), no number, except possibly po, can satisfy the equiv- 


alence demanded by the theorem. 
Finally, using (5) and P6 (much as it was used in the proof of Theo- 
rem 2), it follows that po does indeed satisfy the equivalence. @ 


3 Utility, and preference among gambles 

The idea of utility can most conveniently be introduced in connec- 
tion with gambles or, equivalently, acts that with probability one are 
confined to a finite number of consequences, thus: A utility is a function 
U associating real numbers with consequences in such a way that, if 
f = Yp,f; and g = Doyg;; then f < g, if and only if Dp;U(f) < DojU (gj). 
Writing Uf] for 2p:U(f:), the condition takes the form U[f] < U[g]. 
Similarly, it is convenient to understand that, for an act f, 


(1) Uff] = E(U@)). 


In this notation the following obvious theorem gives a slightly different 


characterization of utility. 


A real-valued function of consequences, U, is a utility; 
quivalent to U[f] < Ulg], provided f and g are 
ne confined to a finite set of consequences. 


THEOREM 1 
if and only if f < g is e 
both with probability or 
1s far assumed guarantee that any utilities exist 
1 be extended to an even wider class of acts? 
Does a great diversity of utilities exist, or does the relation = practi- 
cally determine the function U? These questions, here mentioned in 
the order in which they most naturally arise, are manifestly of great 
importance in understanding utility. For technical reasons, they will 


juction to the theory of the real numbers for explana- 


Do the postulates thu 
at all? Can Theorem 


t Cf., if necessary, any introd $ 
tion of this principle, e+ Chapter TI of [G |. 
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be answered in a different order—the third followed by the first in this 
section, and the second in the next section. 

If there is a utility at all, there is surely more than one, because a 
utility plus a constant and a utility times a positive constant are also 
obviously utilities; thus: 


THEOREM 2 If U is a utility, and P, @ are real numbers with p > 0; 
then U’ = pU + ø is also a utility. 


COROLLARY 1 If there exists a utility, and if Í < g; then there ex- 
ists a utility U for which U(f) and U(g) are any preassigned pair of 
numbers, provided U(f) < U(g). 


Theorem 2 says that any increasing linear function of a utility is a 
utility. The next theorem says that, conversely, any two utilities are 
necessarily increasing linear functions of one another. 


THEOREM 3 If U and U’ are utilities, there exist numbers p and o 
such that U’ = pU + cnp: S> 0. 


Proor. The first step of the proof will be to demonstrate the fol- 


lowing identity for the two utilities U and U’ and for any three conse- 
quences f, g, h. 


1 1 1 
(2) UG) UG) Uh) |=0. 
US) UG) Uh) 


If any two of the consequences f, g, h are equivalent, two columns of 
the determinant in question are equal, and therefore the determinant 
vanishes. It can be assumed, then, that no two of Í, 9, and h are equiv- 
alent; and there is no loss in generality, as may be seen by permuting 
columns, in assuming f < g < h. Theorem 2.4 now permits the con- 
clusion that there is a p such that pf + (1 = p)h = g. Therefore, 


1=pl +(1~,)1 
(3) UU) = PU) + a = puny 
UO) = U'A) + 1 = pyuray, 


Thus the middle row of the determinant is linearly dependent on the 
other two, so the determinant vanishes, as was asserted. 

Now let g and h be any fixed pair of Consequences such that g < h, 
the existence of such a pair being assured by P5. Equation (2) can be 


ll 
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successively rewritten, where f is an arbitrary consequence, thus: 
(9) HUU) — UHU- UNIU) — Vo) 
+ U(FU(h) — U()) = 0, 
U'(h) — U'(g) U(g)U'(h) — KOKON 


6) Uf) = U 
ma U(h) — UG) we U(h) — Ug) 


which proves the theorem; for U’(k) — U'(g) and U(h) — U(g) are 
both positive. @ 


Corotiary 2 If U and V’ are utilities such that, for some g < h, 
UG) = U'(g) and U(h) = U'(h); then U and U’ are the same, that is, 
for every f, U(f) = U’(f). 


To summarize, if there is a utility at all, there are an infinite number, 
but the array of utilities is not complicated; for all can be generated 
from any one by increasing linear transformations. 


Turn now to the question of existence. 
THEOREM 4 There exists a utility. 


Proor. Von Neumann and Morgenstern prove essentially this theo- 
rem, as well as the preceding one, in the appendix of [V4]. The following 
proof is theirs, expressed, as the teacher used to say, in my own words. 

For this proof only, certain special nomenclature is introduced. A 
set of gambles F is convex; if and only if, for every f, g £ F and p, 0 < p 
< 1, of + (1 — p)geF. An interval I of gambles is the set of all gam- 
bles f such that, for some fixed g and h (which determine the interval), 
g Sf<h. A hyper-utility V on a convex set F is a real-valued func- 
tion of the gambles of F, such that f < g, if and only if V(f) < V(g), 
and such that V(pf + (1 — p)g) = eV(F) + (1 — p)V(g). 

The following remarks about this special nomenclature are obvious 
and will be repeatedly used in the proof, without explicit reference. 
The set of all gambles is convex. The intersection of two convex sets 
is convex. Every interval is convex. There is an interval containing 
any finite set of gambles. If there is a hyper-utility on the set of all 
gambles, it is a utility when confined to consequences. 

By the same method that led to the proofs of Theorems 2 and 3, 
if there is a hyper-utility on F containing g and h, with g < h, then there 
is one and only one hyper-utility V on F such that V(g) = 0 and V(h) 


=1. 
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If Iis the interval determined by g < h, then, according to Theorem 
2.4, there is for every f in I a unique number, call it V(f), such that 
(6) f= (1 — V(f))g + VPA. 

By repeated use of Theorem 2.2, it follows for any f, f’ <I that 
(7) pf + (1 — p)f = pf (lL — V(F))\g + VFA} 

+ (= p){ = Vg + VEDA} 
{1 — [pV (F) + (1 — p) VFI }g 

+ VE) + (1 — p)V(F)IA, 
so V is a hyper-utility on the convex set I. i 

From here on in this proof, let g, h be a fixed pair of consequences with 
g <h. Making use of the preceding two paragraphs, there is a unique 
hyper-utility assigning the values 0 and 1 to g and h, respectively, on 
any one interval containing g and h. The intersection of two such in- 
tervals is a convex set containing g and h, and on the intersection the 
hyper-utilities associated with the two intervals are both hyper-utilities 
attaching 0 and 1 to g and h, respectively; they must, therefore, be 
equal to one another on the intersection. 

Any gamble f is an element of some interval containing g and h. 
Let V(f) be the common value assigned to f by all the hyper-utilities 
that are defined on intervals containing f, g, and h and that assign the 
values 0 and 1 to g and h, respectively. Since there is always at least 
one such interval for any gamble f, the function V is defined for all 
gambles. 

The proof will be complete when it is shown that V is a hyper-utility 
for the convex set of all gambles. Let f and f’ be any two gambles and 
p a number, 0 < p <1. There is an interval containing f, f’, g, h, and 
pf + (1 — p)f'. In that interval the function V is a hyper-utility. 


Therefore V(pf + (1 — p)f’) = pV (f) + (1 — p)V(f’) and V(f) < V(F), 
if and only if f < f. @ 


I 


4 The extension of utility to more general acts 

The requirement that an act have only 
quences may seem, from a practical point of view, almost no require- 
ment at all. To illustrate, the number of time intervals that might 
possibly be the duration of a human life can be regarded as finite, if 
you agree that the duration may as well be rounded to the nearest 
minute, or second, or microsecond, and that there is almost no possi- 
bility of its exceeding a thousand years. More generally, it is plausible 


a finite number of conse- 
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that, no matter what set of consequences is envisaged, each conse- 
quence can be practically identified with some element of a suitably 
chosen finite, though possibly enormous, subset. It might therefore 
seem of little or no importance to extend the concept of utility to acts 
having an infinite number of consequences. If that argument were 
valid, it could easily be extended to reach the conclusion that infinite 
sets are irrelevant to all practical affairs, and therefore to all parts of 
applied mathematics. But it is one of the most profound lessons of 
mathematical experience that infinite sets, tactfully handled, can lead 
to great simplification of situations that could, in principle, but only 
with enormous difficulty, be treated in terms of finite sets. How diffi- 
cult it would be to study geometry if one made at the outset the “sim- 
plifying assumption” that to all intents and purposes at most 10100? 
points in space can be discriminated from one another! Again, it is 
generally more convenient and fruitful to think of the annual cash in- 
come of an individual or firm as a continuous variable with an infinite 
number of possible values than as a discrete variable confined to some 
large finite number of values, even if it is known that the income must 
be some integral number of cents less, say, than 1072, 

One way to extend the concept of utility to acts with an infinite 
number of consequences would be to postulate: If U[f] and U[g] both 
exist (the values +% and — being regarded as possible); f < g, if 
and only if U[f] < Ulg]. I see no serious objection to making this as- 
sumption outright, though it might be complained that the assumption 
is motivated more by general mathematical intuition and experience 
than by intuitive standards of consistency among decisions, which I 
have tried to take as my sole guide thus far, A statement almost as 
strong as the one in question can, however, be derived on adjoining a 
new postulate, P7, more in the spirit of P1-6. That rather technical 


: : ed re 
program will be carried out in the next sever al paragraphs. Those not 


interested can safely skip to the paragraph following Corollary 1 on 


page 80. 
Suppose that ev 


attractive to the person as the act f consider asi 
n the spirit of the sure-thing principle to conclude that 


f < g; the same might as fairly have been said for the relations > , and 
also for the two relations < given B and > given B. This idea is for- 
malized in the following postulate, which, according to the conven- 
tions of mathematical double-talk, is to be interpreted as two proposi- 
tions—one having < and the other > throughout. 


ery possible consequence of the act g is at least as 
ed as a whole; then it seems 


to me withi 


P7 Tf f < (>) g(s) given B for every s £ B, then f < (>) g given B. 
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Attention has been called to the mathematically useful fact that, if 
P1-6 apply to a relation <, then they also apply to any relation < 
given B, provided B is not null. It is obvious that the same is true for 
P1-7, a fact that will be used often. It is also noteworthy that P1-7 
obviously imply the propositions that arise if in them every instance 
of the sign < is replaced by > and every instance of > is replaced by 
<. Therefore in any deduction from P1-7 every instance of the signs 
< and > can be reversed to produce a deduction that may be called 
the symmetric dual of the original deduction. This remark, a legitimate 
child of the principle of insufficient reason, has not been important 
heretofore, because almost all deductions thus far made have been their 
own symmetric duals. Since that will not be so of some of the lemmas 
in the present section, much needless writing and thinking can be saved 
by agreeing at the outset that, once a result is proved, it and its sym- 
metric dual may be used as if both had been explicitly proved. 

Before going to work with P7, some may wish to see an example of 
a mathematical structure satisfying P1-6 but not satisfying P7. More- 
over, understanding of such an example will do much to clarify the uses 
to be made of P7. To construct the example, begin by letting S be a 
set carrying a finitely additive probability measure P under which S 
can be partitioned into subsets of arbitrarily small probability. Let 


the set of consequences be the half-open interval of numbers 0 < f < 1. 
Let U(J) = f, Ulf] = EB), and 


(1) Vif] = lim P{J(8) lea 


Since the probability in (1) decreases with e, there is no question about 
the existence of the limit. Now let Wf] = Ulf] + V{f], and define 
f <g to mean that W[f] < Wig]. Checking postulates P1-6, it will 
be found that the < thus defined satisfies them all, and that what has 
here been called U(f) is indeed a utility for <. But if, for example, 
there is an f such that Uff] = V(f] = 4, P7 is violated, as can be seen 
by comparing f to the act that, for each s, takes as value the maximum 
of 2 and f(s). Whether there can be such an f, may, so far as I know, 
depend on the choice of S and P. But, if the positive integers are taken 
as S, and P is so chosen that though the probability of any one integer 
is 0 the probability of the set of even integers is 1/2, a possibility as- 
sured by the note to Section 3 of Chapter II on p. 231 of [B4], the func- 
tion equal to 0 at the odd integers and equal to (1 — 1/n) at each even 
n is such an f. Finite, as opposed to countable, additivity seems to be 
essential to this example; perhaps, if the theory were worked out in & 


countably additive spirit from the start, little or no counterpart of P7 
would be necessary. 
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Several lemmas depending on P7 are now to be proved preparatory 
to proving that U[f] governs preference for a very large class of acts. 
It is to be understood throughout the section that U is any fixed utility. 
The truth of each lemma is intuitively clear, in the sense that each could 
justifiably be accepted as a postulate if need be. Since they are also 
easy to prove and of secondary interest, condensed proofs will suffice. 


LEMMA 1 If, for every consequence h, f < h, andg < h; then f = g. 


Proor. Consider in the light of P7 that f < g(s) and g < f(s) for 
every s. © 
Lemma 2 If there exists a consequence fo such that f < fo, and if 
U(f(s)) < Uo for every s, then there exists a gamble g such that f < g 
and U[g] < Uo. 


Proor. If U(fo) < Uo, then g can be taken to consist of fo alone. 
Otherwise, let fı be any consequence such that U(f,) < Uo and let g 
be the unique mixture of fo and fı such that U(g) = Uo. @ 


Lemma 3 
Hyp. 1. The Bjs, i= 1, +++, n, are a partition, and the U;,’s are 
corresponding numbers. 

2. f is an act such that U(f(s)) < U; for s e Bi. 

3. fis a gamble such that f < f. 


Conci. — U[f] < DU;P(B)). 


Proor. If the lemma were false, it would be false even for some f < f. 
Then it may be assumed, modifying f if need be by means of PG and 
Lemma 1, that there exists for each i an fi such that f < f; given Bi. 
Now, in view of Lemma 2, there exists for each t a gi such that f < g; 
given B; and U[g;] < U: Let g = DP(B;i)g; and observe that f < 
f< g. Therefore, U[f] < Ulg] = ZP(B)U(g) < ZP(B)U:. © 

An act will be called bounded if its utility is, according to ordinary 


an essentially bounded random variable; the no- 


mathematical usage: 
í ained way as follows: A bounded 


tion is put in a more formal and self-cont 
act is an act f such that, for some two numbers Uo and U;, P{Uo < 


U(f(s)) < U;} = 1. The definition is clearly not dependent on the 


choice of U. 


THEOREM 1 If f and g are bounded, then f < g, if and only if 


Uff] < Ulg). 
Proor. If there exist g and h such that g < f < h, then there is, 
by Theorem 2.4, a mixture f of g and h such that f = f. The null event 
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on which U(f(s)) is not between Up and U, may as well be disregarded ; 
the rest can be partitioned into n + 1 events B; defined by the condition 
that s eB; if and only if Vi < U(f(8s)) < Vi, i= 1, =, n+l, 
where 


(2) v= |1- n+} EATA EE 


Applying Lemma 3 and its symmetric dual, 


(3) Z=Vi-1P(Bi) < Ulf] < SV P(B). 

Similarly, according to Exercise 3 of Appendix 1, 

(4) ZVi—ıP(B:;) < Ulf] < 2V,P(B). 

Therefore 

6) | Ulf — UIA] < 2V; — Vin) PB) = (U1 — Uo)/n, 


whence U(f) = U(f). 

To consider the remaining case, suppose that the bounded act f ex- 
ceeds (is exceeded by) every consequence; call it for the moment big 
(little). According to Lemma 1, all big (and, dually, all little) acts are 
equivalent to one another. Furthermore, it is, for example, easily seen 
that, if an act is big, then for e > 0, 


(6) P{U(S(s)) > sup U(f) ~ e} = 1. 


(Some may be more familiar with the notation “LUB” and “GLB,” 
read “least upper bound” and “greatest, lower bound,” than with the 
corresponding “sup” and “inf,” read “supremum” 
even these older terms are not familiar, s 
Therefore, if there are big (little) acts, 
utility, namely sup U(f) (inf U(f)). 
Suppose now that f < g. It is possible that f and g are both little; 
that f is little, and g is equivalent to some gamble; that f is little and 
g big; that f and g are each equivalent to some gamble; that f is equiva- 
lent to some gamble, and g is big; or, finally, that they are both big. 


In each of these cases, a simple argument shows that U[f] < Ulg]. 
The converse arguments are similar. @ 


and “infimum.” If 
ee Exercise 4 of Appendix 2.) 
they all have the same expected 


Coroutary 1 If f and g are bounded, and P(B) > 0, then f < g 
given B, if and only if E(U(f) — U(g)| B) < 0. 


It would be possible to explore unbounded acts for which expected 
utility exists to see whether expected utility governs preferences among 
even such acts under postulates P1-7 or under some extension of them. 
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I do not think, however, that the question is sufficiently interesting to 
warrant attention here, especially since there is some reason, first stated 
by Gabriel Cramer in a letter partially reproduced in [B10], to postulate 
that there are upper and lower bounds to utility, in which case all acts 
would necessarily be bounded. 

Even without P7, the postulates imply, in the following sense, that 
no gamble has infinite or minus infinite utility. 

An act f has infinite (minus infinite) utility; if and only if, for some 
g <(>)h and for every e > 0, there is a B with P(B) < e and such 
that the act equal to f on B and to g on ~B exceeds (is exceeded by) h. 
A gamble or a consequence would be said to have infinite (minus in- 
Jinite) utility, if one of the acts corresponding to it had infinite (minus 
infinite) utility. 

Indeed, Theorem 2.4, a deduction from P1-6, obviously implies that 
there are no infinite or minus infinite gambles or consequences. It 
may, however, be mentioned that Pascal held that, in just the sense 
at hand, salvation is an infinite consequence ([P2], pp. 189-191). Again, 
it is often said, in effect, that the utility to a person of immediate death 
isa consequence of minus infinite utility, but casual observation shows 
that this is not true of anyone—at least not of anyone who would cross 
the street, to greet a friend. In the same vein, medicine often gives lip 
service to the idea that the death of a patient is of minus infinite utility, 
and, of course, doctors do go to great lengths to keep their patients 
alive; but a doctor who took the idea too seriously would make a nui- 
sance of himself and soon find himself with no patients to treasure. 

If the utility of consequences is unbounded, say from above, f then, 
even in the presence of P1-7, acts (though not gambles) of infinite 
utility can easily be constructed. My personal feeling is that, theo- 
logical questions aside, there are no acts of infinite or minus infinite 
utility, and that one might reasonably so postulate, which would amount 
to assuming utility to be bounded. . s ; 

Justifiable though it might be, that assumption would entail a cer- 
tain mathematical awkwardness in many practical contexts. For ex- 
ample, as will be discussed at greater length in Chapter 15, it sometimes 
seems reasonable to suppose that the penalty for acting as though a 
particular unknown number were ĝ instead of its true value, y, is propor- 
tional to ô = (u — a)®. But, if the possible values of » are unbounded, 


then so are the possible values of ô, so utility is here taken to be un- 


bounded. On close scrutiny of such an example one always finds that 


{That is, if, for every V, there is a consequence f such that V < Uf). This 
manner of speaking is permissible; because in view of Theorem 3.3, if one utility is 


bounded, all are. 
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it is not really reasonable to assume the penalty even roughly propor- 
tional to 6° for large values of 6”, but rather that large values are so im- 
probable that the error made in misappraising the penalty associated 
with them is negligible compared to the saving in simplicity resulting 
from the misappraisal. If the assumption of bounded utility were made 
part of the theory of personal probability, then any example in which 
unbounded utility is used for mathematical simplicity would be in con- 
tradiction to the postulates. I propose, therefore, not to assume bounded 
utility formally, but to remember that problems involving unbounded 
utility are to be handled cautiously. 

To take stock of the chapter thus far, utility having been established, 
it is now superfluous to consider that consequences may be of all sorts, 
since the postulates imply that in virtually every context a consequence 
is adequately characterized by its utility, some one utility function 
having been chosen from the linear family of possibilities. Therefore, 
unless the contrary is clearly indicated, f, g, and h will henceforth mean 
not exactly consequences in the sense used to date, but rather real 
numbers measuring utility in units to be called utiles. Correspondingly, 
an act f will henceforth be understood to be a real-valued random varia- 
ble. The entire theory of preference, at least for bounded acts, can 
now be summarized by the following résumé: 


Rf <g given B, if and only if P(B) = 0, or E(f — g | B) <0. 


From now on, though not formulated as a postulate, it is to be assumed 
without further quibbling that R holds, provided only that Æ(f) and 
E(g) exist and are finite; no attempt will be made to compare acts for 
which the expected value does not exist or is infinite. 

If a person is free to decide among a set F of acts, he will presumably 
choose one the expectation of which is v(F), where 


(7) v(F) = sup E(f), 


provided that such a one exists. This provision must be mentioned, 
even though a set F for which v(F) = œ will, by convention, not be 
considered to give rise to a valid decision problem; for, if F is infinite in 
number, there may be no act in F w. 
v(F). Nonetheless, o(F) may, 
utility of the 
of § 6.5. 


5 Small worlds 


ith expectation quite as great as 
in a sense, be regarded as the value or 
set of acts F, as is discussed in the penultimate paragraph 


Allusion was made in the penultimate paragraph of § 2.5 to the prac- 
tical necessity of confining attention to, or isolating, relatively simple 


= 
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situations in almost all applications of the theory of decision developed 
in this book. As was mentioned there, I find it difficult to say with 
any completeness how such isolated situations are actually arrived at 
and justified. The purpose of the present section is to take some steps 
toward the solution of that problem or, at any rate, to set the problem 
forth as clearly as I can. This section, though important for a critical 
evaluation of the thesis of this book, is not essential to a casual reading. 

Making an extreme idealization, which has in principle guided the 
whole argument of this book thus far, a person has only one decision 
to make in his whole life. He must, namely, decide how to live, and 
this he might in principle do once and for all. Though many, like my- 
self, have found the concept of overall decision stimulating, it is cer- 
tainly highly unrealistic and in many contexts unwieldy.t Any claim 
to realism made by this book—or indeed by almost any theory of per- 
sonal decision of which I know—is predicated on the idea that some of 
the individual decision situations into which actual people tend to sub- 
divide the single grand decision do recapitulate in microcosm the mech- 
anism of the idealized grand decision. One application of the theory 
of utility to overall decisions has, however, been attempted by Milton 
Friedman in [F11]. 

The problem of this section is to say as clearly as possible what con- 
stitutes a satisfactory isolated decision situation. The general method 
of attack I propose to follow, for want of a better one, is to talk in terms 
of the grand situation—tongue in cheek—and in those terms to analyze 
and discuss isolated decision situations. I hope you will be able to 
agree, as the discussion proceeds, that I do not lean too heavily on the 
Concept of the grand decision situation. 7 

Consider a simple example. Jones is faced with the decision whether 
to buy a certain sedan for a thousand dollars, a certain convertible also 
for a thousand dollars, or to buy neither and continue carless. The 
simplest analysis, and the one generally assumed, is that Jones is de- 
ciding between three definite and sure enjoyments, that of the sedan, 
the convertible, or the thousand dollars. Chance and uncertainty are 
considered to have nothing to do with the situation. This simple anal- 
ysis may well be appropriate in some contexts; however, it is not difti- 
cult to recognize that Jones must in fact take account of many unger 
tain future possibilities in actually making his choice. The relative 


ild be a mistake, arising out of elliptical 
redicates the choice of a complete life- 
n ever reached such a level of maturity 
from that time on, he would then 


+ Unrealistic though the concept is, it wou 
Presentation, to suppose that the concept p 
long policy by new-born babies. If a persor 
as to be able to make a lifelong choice for his life i 
become a person to whom the concept could be literally applied. 
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fragility of the convertible will be compensated only if Jones’s hope to 
arrange a long vacation in a warm and scenic part of the country ac- 
tually materializes; Jones would not buy a car at all if he thought it 
likely that he would immediately be faced by a financial emergency 
arising out of the sickness of himself or of some member of his family; 
he would be glad to put the money into a car, or almost any durable 
goods, if he feared extensive inflation. This brings out the fact that 
what are often thought of as consequences (that is, sure experiences of 
the deciding person) in isolated decision situations typically are in re- 
ality highly uncertain. Indeed, in the final analysis, a consequence is 
an idealization that can perhaps never be well approximated. I there- 
fore suggest that we must expect acts with actually uncertain conse- 
quences to play the role of sure consequences in typical isolated decision 
situations. 

Suppose now, to elaborate the example, that Jones is presented with 
a choice between tickets in several different lotteries such that, which- 
ever he chooses and whatever tickets are drawn, he will win either 
nothing, the sedan, the convertible, or a thousand dollars. None of 
these four consequences—not even “nothing”—is actually a sure con- 
sequence in the strict sense, as I think you will now understand. I 
propose to analyze Jones’s present decision situation in terms of a 
“small world.” The more colloquial Greek word, microcosm, will be 
reserved for a special kind of small world to be described later. To de- 
scribe the state of the small world is to say which prize is associated 
with each of the tickets offered to Jones. The small-world acts actually 
available to Jones are acceptance of one or another of the tickets. 
The generic small-world act is an arbitrary function taking as its value 
one of the four small-world consequences according to which small- 
world state obtains. 

It will be noticed that the small-world states are in fact events in 
the grand world, that indeed they constitute a partition of the grand 
world. If they are an infinite number of small-world states, as indeed 
there must be, if the small world is to satisfy the postulates P1-7, then 
the partition in question becomes an infinite partition.} These con- 
siderations lead to the following technical definitions. 

Let the grand world S be, as always, a set with elements s, s’, +++ 
The grand-world consequences S may as well be taken to be a bounded 


t Technical note: It is mathematically more general and elegant not to insist that 
the small world have states at all, but rather to 
small-world events. This class should be closed under complements and finite unions. 
In short, the small-world events, and thereby the small world itself, constitute 2 
Boolean subalgebra of the Boolean algebra of the grand-world vente: 


speak of a special class of events as 
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set of real numbers. The grand-world acts are then real-valued func- 


tions f, g, h, ---. The preference ordering between acts is determined 
by the condition that f < g if and only if 
(1) E@ — g) <0, 


where the expected value indicated in (1) is derived from a probability 
measure P characteristic of the grand world or, to be more exact, of 
the person’s attitude toward the grand world. 

The construction of a small world S from the grand world S begins 
with the partition of S into subsets, or small-world states 5, 3’, «++ (not 
necessarily finite in number). Throughout this technical discussion, it 
will be necessary to bear in mind certain double interpretations such 
as that 5 is both an clement of 5 and a subset of S. Strictly speaking, a 
small-world event B in Š is a collection of subsets of S and not itself a 
subset of S. However, the union of all the elements of B, regarded as 
subsets of S, is an event in S; call it [B]. 

The small world, as I mean to define it, is determined not only by 
the definition of a state, but also by the definition of small-world con- 
sequences. A small-world consequence is a grand-world act. A set F of 
grand-world acts, regarded as small-world consequences, is thus part of 
the definition of any given small world. It will be mathematically 
simplest, and cost little if anything in insight, to suppose that the ele- 
ments of F are finite in number. They will be denoted f, 9, h, +++; 
and, when the small-world consequence J is recognized as a grand-world 
act, f(s) will denote the grand-world consequence of f at the grand- 
world state s. 

A small-world act f is, 
to small-world consequences f. 


of course, a function from small-world states § 
In this isolated technical discussion, we 
will hobble along with the notations J(5) for the small-world conse- 
quence attached to § by f, and f(s; 5) for the grand-world consequence 
attached to s by J(s) recognized as a grand-world act. Each small- 
world act Ë gives rise to a unique grand-world act Ê, defined thus: 


©) fis) = pe f(s; 5), 


where §(s) means that small-world state 


State s is an element. vas ae 
The distinction between f and Ê, like some other distinctions I have 


thought it worth while to make in the present complicated context, is 
perhaps pedantic. At any rate, it is to be understood as part of the 
definition of a small world that f < g if and only if f < &, that is, in 


3 of which the grand-world 
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view of (1), if and only if E(@®) < E(g). In this connection, it is useful 
to note that 


(3) EÊ = X LE IGS) = HPF) =H 
ker 
= © Elk FG), = APEE) = $). 
z 


It may be advantageous to review (3), and thereby the whole techni- 
cal definition of a small world, in terms of an example. A small-world 
act, typified by the purchase of a lottery ticket, amounts to accepting 
the consequences of one of several ordinary grand-world acts according 
to which element of a partition does in fact obtain. For example, the 
participant in a lottery may drive away a car, lead away a goat, face 
a firing squad, or remain in the status quo, according to the terms of 
the lottery and according to which ticket he has in fact drawn. Letting 
the example of the lottery stand for the general situation, the expected 
utility of a lottery ticket can be computed by the partition formula 
(3.5.3) from the conditional expectation associated with each ticket, 
which is what (3) does. 

It may fairly be said that a lottery prize is not an act, but rather the 
opportunity to choose from a number of acts. Thus a cash prize puts 
its possessor in a position to choose among many purchases he could 
not otherwise afford. I believe that analysis to be more nearly correct, 
but it is more complicated; and, if one thinks of each set of acts made 
available by a lottery prize as represented by a best act of that set, 
the more complicated analysis seems superfluous, at least in a first 
attack. 

A small world is completely satisfactory for the use to which I mean 
to put it, if and only if it itself satisfies the seven postulates and leads 
to—more technically, agrees with—a probability P such that 


(4) P(B) = P(B) 
for all B CS and has a utility Ū such that 
(5) UF) = EF) 


for all fe F. For the present context, call such a completely satisfac- 
tory small world a microcosm; if the small world satisfies the postulates, 
but does not necessarily admit P as its probability nor Ọ as a utility, 
call it a pseudo-microcosm. 

To display the circumstances under which a small world is a pseudo- 
microcosm, I shall briefly comment on each of the postulates in the 
form given on the end papers of this book, referring to them here as 
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P1-7, as opposed to P1-7, to emphasize that they are here being con- 
sidered with respect to S and F. 
Pl Simple ordering. 

Automatically satisfied. Indeed it is directly implied by P1. 
PZ Conditional preference well defined. 

Automatic. 
P3 Conditional preference does not effect consequences. 

Requires exactly that, for every J, 7 € F, and B CS, either: 
a. Ī <@ given [B], if and only iff<g, or 
b. h < Ā given [B], for every h, ke F. 
In these inequalities the elements of F are of course interpreted as 
grand-world acts. 
Pa Qualitative personal probability well defined. 


Requires exactly that, if f < gand hg < hg, where 


hg(s) =G forse [B] 
=f for s e ~(B] 
(6) = 
hes) = 9 for s e [C] 
=f forse ~[]; 


then h'g < h'g, where h'g and h'g are defined in terms off a F <a, 
_ ? 
in analogy with (6). 


This postulate is automatic in case F has at most two elements. 


P5 T'he person has some definite preference. 
Requires f < g for some J g eF. 
Po Partition of worlds into tiny events. 
not automatic, that is, it is not im- 
the grand world. It is not even im- 


though in the presence of all these 
There seems to be little to gain 


It is clear that this postulate is 
plied by the validity of P1~7 for 
plied by P1-7 together with hae 
P6 co tedly be weakened._ s | 
in Hate ais by reducing P6 to such minimal terms, nor by 
expressing it, as P1-5 have been expressed, in grand-world terms alone; 
for P6 does not lend itself easily to such treatment, though it would be 
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easy to decide in any instance whether P6 obtained w 


ithout undue 
reference to the grand world. 


P7 Strong form of sure-thing principle. 


Automatic, in view of the ex 


plicit assumption that F has only a 
finite number of elements, 


The problem of 
h facilitated by 
of a pseudo-micro- 


the fact that the 


Probability meas 
cosm can be 


ure and a utility 
Written down explicit 


ly, as the next few paragraphs show. 
To study the Problem, | Suppose the small world is a pseudo-micro- 
osm. Then, in view of P5, let g, i, 


G,hbe elements of F 
and let, 


F such that 9 < À, 
a Hh — |B 3 
(7) Q(B) =pf roa P([B]) 

sh — g 


= EN(h = a f no ~ g(s)} dP(s), 


By using P3 to cheek the it is easily 


Verified that Q is a prob- 
lity Measu 


re Q agrees with the re- 

z? h is easily verified on re- 

Bi fae act fā that takes the value h 
thus: 


g Ms) = BO | BypaBy 4 2G| iB) PCR) 


= Dh ~g | [B)) P((B)) + E@) 
= EG ~ ak) + EG). 
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Since ĝ and Å are essentially arbitrary, there are many ways to con- 
struct a probability measure that agrees with the relation < between 
small-world events, but, in the presence of P1-6, all of them must (in 
view of Corollary 3.3.1) be the same as Q. That consideration leads to 
the formula 


(9) EG — F | (BYP(B) = EG - TNB) 
for all J, f’e F and BCS. 7 ; a 

Using (9) and recalling that U(f) has been defined as E(f), (3) can 
be rewritten thus: 


(10) #@) = B@ + EEE - al FO) = HPUE)) = F) 
k 
= Ð UKREG) = $). 
A 


The question whether a given pseudo-microcosm is really a micro- 
cosm is the question whether Q(B) = P([B]) and whether U is a utility 
for the pseudo-microcosm. The answer to the second part is immediate 
and, I think, somewhat surprising, for (10) shows that for any pseudo- 
microcosm U is indeed a utility. _ Seer 

Unfortunately, the condition Q(B) = PCB) is not also automatic. 
The possibility of its failing to be satisfied is illustrated by the following 
simple mathematical example. Let S be the unit square 0 < 2, y <1, 
and let, 


1 1 
(11) E(f) = J Í f(a, y) dx dy. 


It is of no real moment that the integral in (11), if understood in the 
Lebesgue or Riemann sense, is not defined for all bounded functions. 
Let the elements of S be the vertical line segments, « = constant. 
Finally, suppose that the elements of F consist of the funetion zero and 
any finite number of non-negative multiples of a fixed positive function 
h= À. Tt is easy to verify that 5 as thus defined is a pseudo-microcosm 


and that 
(12) Q(B) e qla’) dx’ 
where i 
h(x’, y) dy 
A 0 . 
(13) Ge) == ee 


LA 
il i h(x, y) dx dy 
o Yo 
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Unless q is 1 for every x’, which will not at all typically be the case, S 
is not really a microcosm. 

The general condition that a pseudo-microcosm be a microcosm—i.e., 
that Q(B) = P([B])—is evidently, in view of (9), 


(14) EG — J’ | (B) = EG - F) 


for every f, J' F and every B for which P((B}) > 0. Incidentally, 
that condition alone practically implies that a small world S, not neces- 
sarily assumed to be a pseudo-microcosm, is a real microcosm. More 
exactly, it implies all the postulates P1-7, except P6; and it implies 
that the probability measure P agrees with the relation < between 
small-world events. Also, if a small world is a pseudo-microcosm, it is 
enough that (14) should hold for some pair of functions for which the 
right-hand side of the equation does not vanish. 

Equation (14) is, however, unsatisfactory in that it seems incapable 
of verification without taking the grand world much too seriously. 
Some consolation may derive from the fact that if f and J’ are constants 
they automatically satisfy (14). Two such absolute, or grand-world, 
consequences would suffice, for, as has just been remarked, it is suffi- 
cient that (14) be satisfied for two materially different small-world 
consequences, in the presence of P1-7 (which are verifiable without 
any detailed knowledge of the grand world). It must, however, be ad- 
mitted, as has already been mentioned, that the very idea of a grand- 
world consequence takes the grand world pretty seriously—a point 
forced into my reluctant mind by a conversation with Francesco Bram- 
billa. 

I feel, if I may be allowed to say so, that the possibility of being taken 
in by a pseudo-microcosm that is not a real microcosm is remote, but 
the difficulty I find in defining an operationally applicable criterion is, 
to say the least, ground for caution. 

There certainly seem to be cases in which one could confidently as- 
sume (14), though thus far formal analysis of the source of such se- 
curity escapes me. Consider, for example, a lottery in which numbered 
tickets are drawn from a drum. It seems clear that for an ordinary 
person the outcome of the lottery is utterly irrelevant to his life, except 
through the rules of the lottery itself. In other terms equally loose, 


the value of a thousand dollars, or of a car, to a person would not ordi- 


narily depend at all on what numbers were drawn in a lottery, unless 
the person himself ( 


or perhaps some other person or organization with 
whom he had some degree of contact) held tickets in the lottery. A 


more precise formulation, which does indeed imply (14), is that the 
events that represent the outcome of the lottery are all statistically 
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independent of the grand-world acts, or functions, that typically enter 
as prizes in a lottery. This suggests once more that it would be desir- 
able, if possible, to find a simple qualitative personal description of in- 
dependence between events. (Compare the first paragraph after 
(3.5.2).) 


6 Historical and critical comments on utility 


A casual historical sketch of the concept of utility will perhaps have 
some interest as history. At any rate, most of the critical ideas per- 
taining to utility that I wish to discuss find their places in such a sketch 
as conveniently as in any other organization I can devise. Much more 
detailed material on the history of utility, especially in so far as the 
economics of risk bearing is concerned, is to be found in Arrow’s review 
article [AG]. Stigler’s historical study [S18] emphasizes the history of 
the now almost obsolete economic notion of utility in riskless situations, 
a notion still sometimes confused with the one under discussion. 

As was mentioned in § 4.5, the earliest mathematical studies of prob- 
ability were largely concerned with gambling, particularly with the 
question of which of several available cash gambles is most advanta- 
geous. Early probabilists advanced the maxim that the gamble with 
the highest expected winnings is best or, in terms of utility, that wealth 
measured in cash is a utility function. Some sense can be seen in that 
maxim, which will here be called by its traditional though misleading 
name, the principle of mathematical expectation. First, it has often been 
argued that the principle follows for the long run from the weak law of 
large numbers, applied to large numbers of independent bets, in each 
of which only sums that the gambler considers small are to be won or 
lost. Second, Daniel Bernoulli, who, in [B10], was one of the first to 
introduce a general idea of utility corresponding to that developed in 
the preceding three sections, made the following analysis of the princi- 
ple, which justifies its application in limited but important contexts. 
If the consequences f to be considered are all quantities of cash, it is 
reasonable to suppose that U(f) will change smoothly with changes in 
J. Therefore, if a person’s present wealth is fo, and he contemplates 
various gambles, none of which can greatly change his wealth, the 
utility function can, for his particular purpose, be approximated by its 


tangent at fo, that is, 
(1) UG) = UU) + (f — U), 


a linear function of f. Since a constant term is irrelevant to any com- 
Parison of expected values, the approximation amounts to regarding 
utility as proportional to wealth, that is, to following the principle of 
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mathematical expectation. So far as I know, the only other argument 
for the principle that has ever been advanced is one concerning equity 
between two players. As Bernoulli says, that argument is irrelevant at 
best; and neither of the relevant arguments justifies categorial accept- 
ance of the principle. None the less, the principle was at first so cate- 
gorically accepted that it seemed paradoxical to mathematicians of the 
early eighteenth century that presumably prudent individuals reject 
the principle in certain real and hypothetical decision situations. 
Daniel Bernoulli (1700-1782), in the paper [B10], seems to have 
been the first to point out that the principle is at best a rule of thumb, 
and he there suggested the maximization of expected utility as a more 
-valid principle. Daniel Bernoulli’s paper reproduces portions of a let- 
ter from Gabriel Cramer to Nicholas Bernoulli, which establishes 
Cramer’s chronological priority to the idea of utility and most of the 
other main ideas of Bernoulli’s paper. But it is Bernoulli’s formulation 
together with some of the ideas that were specifically his that became 
popular and have had widespread influence to the present day. It is 
therefore appropriate to review Bernoulli’s paper in some detail. 
Being unable to read Latin, I follow the German edition [B11]. 
Bernoulli begins by reminding his readers that the principle of mathe- 
matical expectation, though but weakly supported, had theretofore 
dominated the theory of behavior in the face of uncertainty. He says 
that, though many arguments had been given for the principle, they 
were all based on the irrelevant idea of equity among players. It seems 
hard to believe that he had never heard the argument justifying the 
principle for the long run, even though the weak law of large numbers 
was then only in its mathematical infancy. Ars Conjectandi [B12], then 
a fairly up-to-date and most eminent treatise on probability, does seem 
to give only the argument about equity, and that in countless forms. 
This treatise by Daniel’s uncle, Jacob (= James) Bernoulli (1654-1705), 
incidentally, contains the first mathematical advance toward the weak 
law, proving it for the special case of repeated trials. 
Many examples show that the principle of mathem 


vere . atical expecta- 
tion is not universally applicable. 


Daniel Bernoulli promptly presents 
one: “To justify these remarks, let us Suppose a pauper happens to ac- 
quire a lottery ticket by which he may with equal probability win 
either nothing or 20,000 ducats. Will he have to evaluate the worth 


of the ticket as 10,000 ducats; and would he be acting foolishly, if he 
sold it for 9,000 ducats? ” 


Other examples occur later in the paper as illustr 
of the utility concept. 
against loss at sea, 


ations of the use 
Thus a prudent merchant may insure his ship 
though he understands perfectly well that he is 
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thereby increasing the insurance company’s expected wealth, and to 
the same extent decreasing his own. Such behavior is in flagrant vio- 
lation of the principle of mathematical expectation, and to one who held 
that principle categorically it would be as absurd to insure as to throw 
money away outright. But the principle is neither obvious nor de- 
duced from other principles regarded as obvious; so it may be challenged, 
and must be, because everyone agrees that it is not really insane to 
insure. 

Bernoulli cites a third, now very famous, example illustrating that 
men of prudence do not invariably obey the principle of mathematical 
expectation. This example, known as the St. Petersburg paradox (be- 
cause of the journal in which Bernoulli’s paper was published) had ear- 
lier been publicized by Nicholas Bernoulli, and Daniel acknowledges 
it as the stimulus that led to his investigation of utility. Suppose, to 
state the St. Petersburg paradox succinctly, that a person could choose 
between an act leaving his wealth fixed at its present magnitude or one 
that would change his wealth at random, increasing it by (2” — f) dol- 
lars with probability 2" for every positive integer n. No matter how 
large the admission fee f may be, the expected income of the random 
act is infinite, as may easily be verified. Therefore, according to the 
principle of mathematical expectation, the random act is to be pre- 
ferred to the status quo. Numerical examples, however, soon convince 
any sincere person that he would prefer the status quo if f is at all 
large. If f is $128, for example, there is only 1 chance in 64 that a 
Person choosing the random act will so much as break even, and he 
will otherwise lose at least $64, a jeopardy for which he can seek com- 
pensation only in the prodigiously improbable winning of a prodigiously 
high prize. 

Appealing to intuition, 
son’s wealth is not its true, or moral, worth to him. Thus, 
Bernoulli, the dollar that might be precious to a pauper would be nearly 
worthless to a millionaire—or, better, to the pauper himself were he to 


Bernoulli then postulates that people do seek 
at has been called 


Bernoulli says that the cash value of a per- 
according to 


become a millionaire. 
to maximize the expected value of moral worth, or wh 
moral expectation. 3 : 

Operationally, the moral worth of a person's wealth, so far as it con- 
cerns behavior in the face of uncertainty, iS just what I would call the 
utility of the wealth, and moral expectation is expectation of utility. 
Jicholas Bernoulli as his uncle, but, in view of dates men- 


s paper and the genealogy in Chapter 8 of [B9], 
usin (1687-1759), perhaps using “uncle” as 


} Daniel refers to this } 
tioned in the last section of Daniel’ 
I think he must have meant his elder co 
a term of deference. 
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It seems mystical, however, to talk about moral worth apart from 
probability and, having done so, doubly mystical to postulate that this 
undefined quantity serves as a utility. These obvious criticisms have 
naturally led many to discredit the very idea of utility, but §§ 2—4 
show (following von Neumann and Morgenstern) that there is a more 
cogent, though not altogether unobjectionable, path to that concept. 
Bernoulli argued, elaborating the example of the pauper and the 
millionaire, that a fixed increment of cash wealth typically results in 
an ever smaller increment of moral wealth as the basic cash wealth to 
which the increment applies is increased. He admitted the possibility 
of examples in which this law of diminishing marginal utility, as it has 
come to be called in the literature of economics, might fail. For ex- 
ample, a relatively small sum might be precious to a wealthy prisoner 
who required it to complete his ransom. But Bernoulli insisted that 
such examples are unusual and that as a general rule the law may be 
assumed. In mathematical terms, the law says that utility as a func- 
tion of money is a concave (i.e., the negative of a convex) function. t 
It follows from the basic inequality concerning convex functions (Theo- 
rem 1 of Appendix 2) that a person to whom the law of diminishing 
marginal utility applies will always prefer the status quo to any fair 
gamble, that is, to any random act for which the change in his expected 
wealth is zero, and that he will always be willing to pay something in 
addition to its actuarial, or expected, value for insurance against any 
loss to himself. The law of diminishing marginal utility has been very 
popular, and few who have considered utility since Bernoulli have dis- 
carded it, or even realized that it was not necessarily part and parcel 
of the utility idea. Of course, the law has been embraced eagerly and 
uncritically by those who have a moral aversion to gambling. 
Bernoulli went further than the law of diminishing marginal utility 
and suggested that the slope of utility as a function of wealth might, 
at least as a rule of thumb, be supposed, not only to decrease with, but 
to be inversely proportional to, the cash value of wealth. This, he 
pointed out, is equivalent to postulating that utility is equal to the 
logarithm (to any base) of the cash value of wealth. To this day, no 
other function has been suggested as a better prototype for Everyman’s 
utility function. None the less, as Cramer pointed out in his aforemen- 
tioned letter, the logarithm has a serious disadvantage; for, if the loga- 
rithm were the utility of wealth, the St. Petersburg paradox could be 


t Often the meanings of “convex” and “concave” as applied to functions are in- 
terchanged. A function is here called convex if it appears convex, in the ordinary 
sense of the word, when viewed from below. Such a function is, of cúise also con- 
cave from above, whence the confusion. Cf. Appendix 2, i i 
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amended to produce a random act with an infinite expected utility 
G.e., an infinite expected logarithm of income) that, again, no one would 
really prefer to the status quo. To take a less elaborate example, sup- 
pose that a man’s total wealth, including an appraisal of his future 
earning power, were a million dollars. If the logarithm of wealth were 
actually his utility, he would as soon as not flip a coin to decide whether 
his wealth should be changed to ten thousand dollars—roughly $500 
per year—or a hundred million dollars. This seems preposterous to 
me. At any rate, I am sure you can construct an example along the 
same lines that will seem preposterous to you. Cramer therefore con- 
cluded, and I think rightly, that the utility of cash must be bounded, 
at least from above. It seems to me that a good argument can also be 
adduced for supposing utility to be bounded from below, for, however 
wealth may be interpreted, we all subject our total wealth to slight 
jeopardy daily for the sake of a large probability of avoiding more 
moderate losses. But the logarithm is unbounded both from above 
and from below; so, though it might be a reasonable approximation to 
a person’s utility in a moderate range of wealth, it cannot be taken 
seriously over extreme ranges. 

Bernoulli’s ideas were accepted wholeheartedly by Laplace [L1], who 
was very enthusiastic about the applications of probability to all sorts 
of decision problems. It is my casual impression, however, that from 
the time of Laplace until quite recently the idea of utility did not 
strongly influence either mathematical or practical probabilists. 

For a long period economists accepted Bernoulli’s idea of moral 
wealth as the measurement of a person’s well-being apart from any 
consideration of probability. Though “utility” rather than “moral 
worth” has been the popular name for this concept among English- 
Speaking economists, it is my impression that Bernoulli’s paper is the 


principal, if not the sole, source of the notion for all economists, though 


the paper itself may often have been lost sight of. Economists were for 


a time enthusiastic about the principle of diminishing marginal utility, 
and they saw what they believed to be reflections of it in many aspects 
of everyday life. Why else, to paraphrase Alfred Marshall (pp. 19, 
95 of [M2]), does a poor man walk in a rain that induces a rich man to 
take a cab? 


During the period when the proba bil 
lar with economists, they referred not only to the utility of money, 


but also to the utility of other consequences such as commodities (and 
services) and combinations (or, better, patterns of consumption) of com- 
modities. The theory of choice among consequences was expressed by 
the idea that, among the available consequences, & person prefers those 


pility-less idea of utility was popu- 
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that have the highest utility for him. Also, the idea of diminishing 
marginal utility was extended from money to other commodities. 

The probability-less idea of utility in economics has been completely 
discredited in the eyes of almost all economists, the following argument 
against it—originally advanced by Pareto in pp. 158-159 and the 
Mathematical Appendix of [P1]—being widely accepted. If utility is 
regarded as controlling only consequences, rather than acts, it is not 
true—as it is when acts, or at least gambles, are considered and the 
formal definition in §3, is applied—that utility is determined except 
for a linear transformation. Indeed, confining attention to conse- 
quences, any strictly monotonically increasing function of one utility 
is another utility. Under these circumstances there is little, if any, 
value in talking about utility at all, unless, of course, special economic 
considerations should render one utility, or say a linear family of utili- 
ties, of particular interest. That possibility remains academic to date, 
though one attempt to exploit it was made by Irving Fisher, as is briefly 
discussed in the paragraph leading to Footnote 155 of [S18]. In par- 
ticular, utility as a function of wealth can have any shape whatsoever 
in the probability-less context, provided only that the function in ques- 
tion is increasing with increasing wealth, the provision following from 
the casual observation that almost nobody throws money away. The 
history of probability-less utility has been thoroughly reported by Stig- 
ler [S18]. 

What, then, becomes of the intuitive arguments that led to the no- 
tion of diminishing marginal utility? To illustrate, consider the poor 
man and the rich man in the rain. Those of us who consider diminish- 
ing marginal utility nonsensical in this context think it sufficient to 
say simply that it is a common observation that rich men spend money 
freely to avoid moderate physical suffering whereas poor men suffer 
freely rather than make corresponding expenditures of money; in other 
terms, that the rate of exchange between circumstances producing phys- 
ical discomfort and money depends on the wealth of the person involved. 

In recent years there has been revived interest in Bernoulli’s ideas 
of utility in the technical sense of §§ 2—4, that is, as a function that, so 
to speak, controls decisions among acts, or at least gambles. Ramsey’s 
essays in [R1], which in spirit closely resemble the first five chapters of 
this book, present a relatively early example of this revival of interest. 
Ramsey improves on Bernoulli in that he defines utility operationally 
in terms of the behavior of a person constrained by certain postulates. 
Ramsey’s essays, though now much appreciated, seem to have had 
relatively little influence. 

Between the time of Ramsey and that of von Neumann and Morgen- 


Popa a m - 
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stern there was interest in breaking away from the idea of maximizing 
expected utility, at least so far as economic theory was concerned (cf. 
[Tla]). This trend was supported by those who said that Bernoulli gives 
no reason for supposing that preferences correspond to the expected 
value of some function, and that therefore much more general possi- 
bilities must be considered. Why should not the range, the variance, 
and the skewness, not to mention countless other features, of the dis- 
tribution of some function join with the expected value in determining 
preference? The question was answered by the construction of Ramsey 
and again by that of von Neumann and Morgenstern, which has been 
slightly extended in §§ 2-4; it is simply a mathematical fact that, al- 
most any theory of probability having been adopted and the sure-thing 
principle having been suitably extended, the existence of a function 
whose expected value controls choices can be deduced. That does not 
mean that as a theory of actual economic behavior the theory of utility 
is absolutely established and cannot be overthrown. Quite the con- 
trary, it is a theory that makes factual predictions many of which can 
easily be observed to be false, but the theory may have some value in 
making economic predictions in certain contexts where the departures 
from it happen not to be devastating. Moreover, as I have been argu- 


ing, it may have value as a normative theory. . 
Von Neumann and Morgenstern initiated among economists and, to 
ans an intense revival of interest 


a lesser extent, also among statistici ay I 
in the technical utility concept by their treatment of utility, which ap- 
pears as a digression in [V4]. as ’ 
The von Neumann-Morgenstern theory of utility has produced this 
reaction, because it gives strong intuitive grounds for accepting the 
Bernoullian utility hypothesis as a consequence of well-accepted moan 
of behavior. To give readers of this book some idea of the von Neu- 
mann-Morgenstern theory, I may repeat that the treatment of utility 
as applied to gambles presented in §3 is virtually copied from their 
book [V4]. Indeed, their ideas on this subject are responsible for almost 
all of my own. One idea now held by me that I think von Neumann 
and Morgenstern do not explicitly support, and that so far as I know 
they might not wish to have attributed to them, 1s the normative in- 


terpretation of the theory. 


Of course, much of the new interest in utility takes the form of criti- 


cism and controversy. The greater part of this discussion that has come 
to my attention has not yet been published. a list of references lead- 
ing to most of that which has is [B7], [wid], [S1], [C4], [F13], [A2]. 

I shall successively discuss each of the recent a of the 
modern theory of utility known to me. My method in each case will 
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be first to state the criticism in a form resembling those in which it is 
typically put forward, regardless of whether I consider that form well 
chosen. I will then discuss the criticism, elaborating its meaning and 
indicating its rebuttal, when there seems to me to be one. 


(a) Modern economic theorists have rigorously shown that there is 
no meaningful measure of utility. More specifically, if any function U 
fulfills the role of a utility, then so does any strictly monotonically in- 
creasing function of U. It must, therefore, be an error to conclude that 
every utility is a linear function of every other. 


This argument has been advanced with a seriousness that is surpris- 
ing, considering that it concedes little intelligence or learning to the 
proponents of the utility theory under discussion and considering that 
it results, as will immediately be explained, from the baldest sort of a 
terminological confusion. To be fair, I must go on to say that I have 
never known the argument to be defended long in the presence of the 
explanation I am about to give. 

In ordinary economic usage, especially prior to the work of von Neu- 
mann and Morgenstern, a utility associated with gambles would pre- 
sumably be simply a function U associating numbers with gambles in 
such a way that f < g, if and only if U(f) < U(g); though economic 
discussion of utility was, prior to von Neumann and Morgenstern, al- 
most exclusively confined to consequences rather than to gambles or 
to acts. It is unequivocally true, as I have already brought out, that 
any monotonic function of a utility in this wide classical sense is itself 
a utility. What von Neumann and Morgenstern have shown, and 
what has been recapitulated in § 3, is that, granting certain hypotheses, 
there exists at least one classical utility V satisfying the very special 
condition 


@) Vat + Bg) = aV (P + BV (9), 


where f and g are any gambles and a, 6 are non-negative numbers such 
that a + 6 = 1. Furthermore, if I may for the moment call a classical 
utility satisfying (2) a von Neumann-Morgenstern utility, every von 
Neumann-Morgenstern utility is an increasing linear function of every 
other. To put the point differently, the essential conclusion of the von 
Neumann-Morgenstern utility theory is that (2) ean be satisfied by a 
classical utility, but not by very many. The confusion arises only be- 
cause von Neumann and Morgenstern use the already pre-empted word 
“utility” for what I here call “von Neumann-Morgenstern utility.” 
In retrospect, that seems to have been a mistake in tacties, but one of 
no long-range importance. 
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(b) The postulates leading to the von Neumann-Morgenstern con- 
cept of utility are arbitrary and gratuitous. 


Such a view can, of course, always be held without the slightest fear 
of rigorous refutation, but a critic holding it might perhaps be persuaded 
away from it by a reformulation of the postulates that he might find 
more appealing than the original set, or by illuminating examples. In 
particular, P1-7 are quite different from, but imply, the postulates of 
von Neumann and Morgenstern. Incidentally, the main function of 
the von Neumann-Morgenstern postulates themselves is to put the es- 
sential content of Daniel Bernoulli's “postulate” into a form that is 
less gratuitous in appearance. At least one serious critic, who had at 
first found the system of von Neumann and Morgenstern gratuitous, 
changed his mind when the possibility of deriving certain aspects of 
that system from the sure-thing principle was pointed out to him. 


(c) The sure-thing principle goes too far. For example, if two lot- 
teries with cash prizes (not necessarily positive) are based on the same 
set of lottery tickets and so arranged that the prize that will be assigned 
to any ticket by the second lottery is at least as great as the prize as- 
signed to that ticket by the first lottery, then there is no doubt that 
virtually any person would find a ticket in the first lottery not prefer- 
able to the same ticket in the second lottery. If, however, the prizes 
in each lottery are themselves lottery tickets, such that the prize asso- 
ciated with any ticket in the first lottery is not preferred by the person 
under study to the prize associated with the same ticket by the second 
lottery, the conclusion that the person will not prefer a ticket in the 
first lottery to the same ticket in the second is no longer compelling. 


preceding one in that the intuitive appeal 


This point resembles the ppea 
dicated, not proved. I do think it 


of an assumption can at most be in : : t 
cogent, however, to stress in connection with this particular point that 


a cash prize is to a large extent a lottery ticket in that the uncertainty 
as to what will become of a person if he has a gift of a thousand dollars 
is not in principle different from the uncertainty about what will be- 
come of him if he holds a lottery ticket of considerable actuarial value. 

Perhaps an adherent to the criticism in question would think it rele- 
vant to reply thus: Though cash sums are indeed essentially lottery 
tickets, a sum of money is worth at least as much to a person as a smaller 
sum, in a peculiarly definite and objective sense, because money can, 
if one desires, always be quickly and quietly thrown away, thereby 
making any sum available to a person who already has a larger sum. 
But I have never heard that reply made, nor do I here plead its cogency. 
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(d) An actual systematic deviation from the sure-thing principle and, 
with it, from the von Neumann-Morgenstern theory of utility, can be 
exhibited. For example, a person might perfectly reasonably prefer to 
subsist on a packet of Army K rations per meal than on two ounces of 
the best caviar per meal. It is then to be expected, according to the 
sure-thing principle, that the person would prefer the IX rations to a 
lottery ticket yielding the K rations with probability 9/10 and the 
caviar diet with probability 1/10. That expectation is no doubt ful- 
filled, if the lottery is understood to determine the person’s year-long 
diet once and for all. But, if the person is able to have at each meal a 
lottery ticket offering him the K rations or the caviar with the indicated 
probabilities, it is not at all unlikely, granting that he likes caviar and 
has some storage facilities, that he will prefer this “lottery diet.” This 
conclusion is in defiance of the principle that “the theory of consumer 
demand is a static theory.” (Cf. [W14].) 


I admit that the theory of utility is not static in the indicated sense, 
as the foregoing example conclusively shows. But there is not the 
slightest reason to think of a lottery producing either a steady diet of 
caviar or a steady diet of K rations as being the same lottery as one 
having a multitude of different prizes almost all of which are mixed 
chronological programs of caviar and K rations. The fact that a theory 
of consumer behavior in riskless situations happens to be static in the 
required sense (under certain special assumptions about storability and 
the linearity of prices) is no argument at all that the theory of consumer 
behavior in risky circumstances should be static in the same sense (as 
I mention in a note appended to [W14]). 


(e) If the von Neumann-Morgenstern theory of utility is not static, 


it is not subject to repeated empirical observation and is therefore 
vacuous. (Cf. [W14].) 


I think the discussion in § 3.1 of how to determine the preferences of 
a hot man for a swim, a shower, and a glass of beer, and the discussion 
in §5 of the practicality of identifying pseudo-microcosms are steps 
toward showing how the theory can be put to empirical test without 
making repeated trials on any one person. 


(f) Casual observation shows that real people frequently and fla- 
grantly behave in disaccord with the utility theory, and that in fact be- 


havior of that sort is not at all typically considered abnormal or ir- 
rational. 


Two different topics call for discussion under this heading. In the 
first place, it is undoubtedly true that the behavior of people does often 
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flagrantly depart from the theory. None the less, all the world knows 
from the lessons of modern physics that a theory is not to be altogether 
rejected because it is not absolutely true. It seems not unreasonable to 
suppose, and examples could easily be cited to confirm, that in the ex- 
tremely complicated subject of the behavior of people very crude theory 
can play a useful role in certain contexts. 

Second, many apparent exceptions to the theory can be so reinter- 
preted as not to be exceptions at all. For example, a flier may be ob- 
served doing a stunt that risks his life, apparently for nothing. That 
seems to be in complete violation of the theory; but, if in addition it is 
known that the flier has a real and practical need to convince certain 
colleagues of his courage, then he is simply paying for advertising with 
the risk of his life, which is not in itself in contradiction to the theory. 
Or, suppose that it were known more or less objectively that the flier 
has a need to demonstrate his own courage to himself. The theory 
would again be rescued, but this time perhaps not so convincingly as 
before. In general, the reinterpretation needed to reconcile various 
sorts of behavior with the utility theory is sometimes quite acceptable 
and sometimes so strained as to lay whoever proposes it open to the 
charge of trying to save the theory by rendering it tautological. The 
same sort of thing arises in connection with many theories, and I think 
there is general agreement that no hard-and-fast rule can be laid down 
as to when it becomes inappropriate to make the necessary reinterpre- 
tation. For example, the law of the conservation of energy (or its 
atomic age variant, the law of the conservation of mass and energy) 
owes its success largely to its being an expression of remarkable and 
reliable facts of nature, but to some extent also to certain conventions 
by which new sorts of energy are so defined as to keep the law true. 
A stimulating discussion of this delicate point in connection with the 
theory of utility is given by Samuelson in [S1]. 
ain hypothetical decision situations sug- 
gests that the sure-thing principle and, with it, the theory of utility 
are normatively unsatisfactory. Consider an example based on two de- 
cision situations each involving two gambles. 


(g) Introspection about cert 


Situation 1. Choose between 


Gamble 1. $500,000 with probability 1; and - 

Gamble 2. $2,500,000 with probability 0.1, 
$500,000 with probability 0.89, 

ability 0.01. 


status quo with pro 
Another interesting example was 


ue to Allais [A2]. 


+ This particular example is d 
eorges Morlat [C4]. 


presented somewhat earlier by G 
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Situation 2. Choose between 


Gamble 3. $500,000 with probability 0.11, 
status quo with probability 0.89; and 

Gamble 4. $2,500,000 with probability 0.1, 
status quo with probability 0.9. 


Many people prefer Gamble 1 to Gamble 2, because, speaking quali- 
tatively, they do not find the chance of winning a very large fortune in 
place of receiving a large fortune outright adequate compensation for 
even a small risk of being left in the status quo. Many of the same 
people prefer Gamble 4 to Gamble 3; because, speaking qualitatively, 
the chance of winning is nearly the same in both gambles, so the one 
with the much larger prize seems preferable. But the intuitively ac- 
ceptable pair of preferences, Gamble 1 preferred to Gamble 2 and Gam- 
ble 4 to Gamble 3, is not compatible with the utility concept or, equiva- 
lently, the sure-thing principle. Indeed that pair of preferences implies 
the following inequalities for any hypothetical utility function. 


U ($500,000) > 0.1U ($2,500,000) + 0.89U ($500,000) + 0.1U ($0), 
(3) 
0.1U ($2,500,000) + 0.9U ($0) > 0.11U ($500,000) + 0.89U (#0); 


and these are obviously incompatible. 

Examples { like the one cited do have a strong intuitive appeal; even 
if you do not personally feel a tendency to prefer Gamble 1 to Gamble 2 
and simultaneously Gamble 4 to Gamble 3, I think that a few trials 
with other prizes and probabilities will provide you with an example 
appropriate to yourself. 

Tf, after thorough deliberation, anyone maintains a pair of distinct 
preferences that are in conflict with the sure-thing principle, he must 
abandon, or modify, the principle; for that kind of discrepancy seems 
intolerable in a normative theory. Analogous circumstances forced 
D. Bernoulli to abandon the theory of mathematical expectation for 
that of utility [B10]. In general, a person who has tentatively accepted 
a normative theory must conscientiously study situations in which the 
theory seems to lead him astray; he must decide for each by reflection 
—deduction will typically be of little relevance—whether to retain his 
initial impression of the situation or to accept the implications of the 
theory for it. 

To illustrate, let me record my own reactions to the example with 


j Allais has announced (but not yet published) an empirical investigation of the 
responses of prudent, educated people to such examples [A2]. 
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which this heading was introduced. When the two situations were 
first presented, I immediately expressed preference for Gamble 1 as 
opposed to Gamble 2 and for Gamble 4 as opposed to Gamble 3, and I 
still feel an intuitive attraction to those preferences. But I have since 
accepted the following way of looking at the two situations, which 
amounts to repeated use of the sure-thing principle. 

One way in which Gambles 1-4 could be realized is by a lottery with 
a hundred numbered tickets and with prizes according to the schedule 
shown in Table 1. 


TABLE 1. Prizes IN UNITS OF $100,000 IN A LOTTERY REALIZING 
GAMBLES 1-4 
Ticket Number 
4 2-11 12-100 


Gamble 1 


Situation 1 tenths 2 


Gamble.3 
Gamble 4 


5 
5 


oo 


2 


on 


Situation 2 | 


Now, if one of the tickets numbered from 12 through 100 is drawn, it 
will not matter, in either situation, which gamble I choose. I therefore 
focus on the possibility that one of the tickets numbered from 1 through 
11 will be drawn, in which case Situations 1 and 2 are exactly parallel. 
The subsidiary decision depends in both situations on whether I would 
sell an outright gift of $500,000 for a 10-to-1 chance to win $2,500,000— 
a conclusion that I think has a claim to universality, or objectivity. 
Finally, consulting my purely personal taste, I find that I would prefer 
the gift of $500,000 and, accordingly, that I prefer Gamble 1 to Gamble 
2 and (contrary to my initial reaction) Gamble 3 to Gamble 4. 

It seems to me that in reversing my preference between Gambles 3 
and 4 I have corrected an error. There is, of course, an important sense 
in which preferences, being entirely subjective, cannot be in error; but 
in a different, more subtle sense they can be. Let me illustrate by a 
simple example containing no reference to uncertainty. A man buying 
a car for $2,134.56 is tempted to order it with a radio installed, which 
will bring the total price to $2,228.41, feeling that the difference is 
trifling. But, when he reflects that, if he already had the car, he cer- 
tainly would not spend $93.85 for a radio for it, he realizes that he has 
made an error. 


tioned before this chapter is closed is 


One thing that should be men 
plays no fundamental role 


that the law of diminishing marginal utility 
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in the von Neumann-Morgenstern theory of utility, viewed either em- 
pirically or normatively. Therefore the possibility is left open that 
utility as a function of wealth may not be concave, at least in some in- 
tervals of wealth. Some economic-theoretical consequences of recog- 
nition of the possibility of non-concave segments of the utility function 
have been worked out by Friedman and myself [F12], and by Friedman 


alone [F11]. The work of Friedman and myself on this point is criti- 
cized by Markowitz [M1]. 


CHAPTER 6 


Observation 


1 Introduction 


With the construction of utility, the theory of decision in the face 
of uncertainty is, in a sense, complete. I have no further postulates 
to propose, and those I have proposed have been shown to be equiva- 
lent to the assumption that the person always decides in favor of an 
act the expected utility of which is as large as possible, supposing for 
simplicity that only a finite number of acts are open to him. At the 
level of generality that has led to this conclusion there seems to be 
little or nothing left to say. To go further now means to go into more 
detail, to investigate special types of decision problems. One type of 
decision problem of central importance is that in which the person is 
called upon to make an observation and then to choose some act in the 
light of the outcome of the observation. 

The consideration of such observational decision problems is a step 
toward those problems of great interest for statistics in which the per- 
son must decide what observation to make, that is, of course, what to 
look at, not what to see. They are the problems of designing experi- 
ments and other observational programs. 
ation were made in Chapter 3, but only now 


Some remarks on observ: l , ; 
stablished is it possible to give a relatively 


that the theory of utility is e 


complete analysis of the concept. LA 
Observation is a concept essential to the study of statistics proper, 


most of what has been said thus far being preliminary to, but not really 
Part of, statistics; even after this chapter and the next one, on obser- 
vation, there will still remain a major transition. One important fea- 
ture of much of what is ordinarily called statistics is, according to 
my analysis, concerned with the behavior not of an isolated person, but 
of a group of persons acting, for example, in concert. In later chapters 
I will deal, so far as I am able, with the problem of group action, but 
preliminary considerations bearing on it will be made and pointed out 
from time to time in this chapter and the next. i 
Though the details of these two chapters may seem mathematically 


forbidding, drastic simplifying assumptions are made in them to keep 
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extraneous difficulties to a minimum. These typically take the form 
of assuming that certain sets of acts, events, and values of random varia- 
bles are finite. Even in elementary applications of the theory, these 
simplifying assumptions seldom actually hold. In some contexts, it is 
quite elementary to relax them sufficiently; in others, serious mathe- 
matical effort has been required; and some are still at the frontier of 
research. Relaxations of the assumptions will be touched on from time 
to time, sometimes explicitly but sometimes only implicitly in the choice 
of suggestive notation and nomenclature. 

Beyond this introduction, the present chapter is divided into four 
sections: § 2 analyzes informally and then formally the notion of a cost- 
free observation; §§ 3 and 4 discuss certain obvious but important con- 
ditions under which one observation, and similarly one set of acts, is 
more valuable than another; § 5 abstractly discusses problems of de- 
signing experiments or, perhaps more generally, observational programs. 


2 What an observation is 


To begin with an informal survey of observation, consider a decision 
problem, that is, a person faced with a decision among several acts. 
Calling it the basic decision problem and the acts associated with it 
the basic acts, a new decision problem would arise, if the person were 
informed before he made his decision that a particular event, say B, 
obtained. The new decision problem is related to the basic decision 
problem in a simple way; for the acts associated with it are also the 
basic acts, and the decision is to be made by computing the expected 
utility given B of the basic acts and deciding on one that maximizes 
the conditional expected utility. The basic problem may be modified 
in still another, though closely related, way. Let the person say in ad- 
vance, for each possible B;, which of the basic acts he will decide on 
when he is informed, as he is to be, which element B; of a given parti- 
tion obtains. This will be called the derived decision problem arising 
from the basic decision problem and the observation of i, and its acts 
will be called derived acts. Technically speaking, the derived acts are 
determined by arbitrarily assigning one basic act to each element of 
the partition. For any state s, the consequence of a derived act is the 
consequence for s of the basic act associated with the particular B; in 
which s lies. The terms informally introduced in this paragraph are 
defined formally later in the section. 

A derived decision problem is not necessarily different in kind from 
the basic problem; indeed it is quite possible that the basic problem can 
itself be viewed as derived from some other basic problem and obser- 
vation. 
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Formidable though the description of a derived problem may seem 
at first reading, its solution is, in a sense, easy and has already almost 
been given; for it is clear that, if P(B;) > 0, the person will decide to 
associate with B; a basic act the expected utility of which given B; is 
as high as possible, and, if P(B;) = 0, it is immaterial to the person 
which basic act is associated with B;. 

It is almost obvious that the value of a derived problem cannot be 
less, and typically is greater, than the value of the basic problem from 
which it is derived. After all, any basic act is among the derived acts, 
so that any expected utility that can be attained by deciding on a basic 
act can be attained by deciding on the same basic act considered as a 
derived act. In short, the person is free to ignore the observation. 
That obvious fact is the theory’s expression of the commonplace that 
knowledge is not disadvantageous. 

It sometimes happens that a real person avoids finding something 
out or that his friends feel duty bound to keep something from him, 
saying that what he doesn’t know can’t hurt him; the jealous spouse 
and the hypochondriae are familiar tragic examples. Such apparent 
exceptions to the principle that forewarned is forearmed call for anal- 
ysis. At first sight, one might be inclined to say that the person who 
refuses freely proffered information is behaving irrationally and in vio- 
lation of the postulates. But perhaps it is better to admit that informa- 
tion that scems free may prove expensive by doing psychological harm 
to its recipient. Consider, for example, a sick person who is certain 


that he has the best of medical care and is in a position to find out 


whether his sickness is mortal. He may decide that his own personality 
is such that, though he can continue with some cheer to live in the 
fear that he may possibly die soon, what is left of his life would be 
agony, if he knew that death were imminent. Under such cinewmateindes, 
far from calling him irrational, we might extol the person's rationality, 
if he abstained from the information. On the other hand, such an in- 
terpretation may seem forced. (Cf. Criticism (£) of § 5.6.) 

Examples of decisions based on observation are on every hand, but 
it will be worth while to examine one in some detail before undertaking 
an abstract mathematical analysis of such decisions. Any example 
would have to be highly idealized for simplicity, because the comiplesty 
of any real decision problem defies complete explicit description, but 
particular simplicity is in order here. 

The desm ee example is considering whether to buy some of the 
grapes he sees in a grocery store and, if so, m what e Sa . k G = 
taste, the grapes may be of any of three qualities, poor, tair, n = 8 
lent. Call the qualities Q generically and 1, 2, and 3 individually. From 
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what the person knows at the moment, including of course the appear- 
ance of the grapes, he cannot be certain of their quality, but he attaches 
personal probability to each of the three possibilities according to 
Table 1. 

Taste 1. P(Q) 


Qluality) 1 2 3 


P(robability) 1/4 1/2 1/4 


The person can decide to buy 0, 1, 2, or 3 pounds of grapes; these 
are the basic acts of the example. Taking one consideration with an- 
other, he finds the consequences of each act, measured in utiles, in 
each of the three possible events to be those given in the body of Table 
2. The expected utilities in the right margin of Table 2 follow, of 
course, from Table 1 and the body of Table 2. 


Tasie 2. Uriniry f(Q) ror EAcH f AND Bacu Q 


Q 

f 1 2 3 | Ba) 
0 0 0 0 0 
1 Zi i o 1 
2 =3 (Oe a 1/2 
3 2 => G =] 


The entries in Table 2 have not been chosen haphazardly, but with 
an attempt at verisimilitude. Thus it is supposed that if the person 
buys grapes of poor quality his dissatisfaction with the bargain will 
accelerate rapidly with the amount bought, which seems reasonable, 
especially if the keeping quality of poor grapes is low. He is, of course, 
unaffected by the quality if he buys none. Again, buying a few fair 
grapes may be mildly desirable, but overbuying is not. Finally, excel- 
lent grapes are worth buying, even in large quantities, but the utility 
of the purchase increases less than proportionally to the amount bought. 

The correct solution of the basic decision problem is to buy 1 pound 
of grapes; for that act has, according to the right margin of Table 2, 
an expected utility of 1, which is the largest that can be attained. 

Now, suppose the person is free to make an observation, that is, & 
new observation in addition to those that may have contributed to the 
determination of the probabilities in the basic problem. It may be, for 
example, that the grocer invites him to eat a few of the grapes or that 
the person is going to ask the woman beside him how they look to ber. 
Let there be five possible outcomes of his observation; call them % 
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generically and 1, 2, 3, 4, and 5 individually. I assume, though this 
feature is rather incidental to the example, that low values of x tend 
to be suggestive of low quality. The joint distribution of x and Q, that 
is, the probability that x and Q simultaneously have any given pair of 
values, is of central technical importance. Those probabilities, each 
multiplied by 128 for simplicity of presentation, are given in the body 
of Table 3. The right-hand and bottom margins of the table give, 


Tapie 3. 128P(¢N Q) 


Q 

x 1 2 3 | 128P(x) 
1 15 5 1 21 
2 10 15 2 27 
3 4 24 4 32 
4 2 15 10 > 2% 
5 1 5 15 21 

32 64 32 128 

128P(Q) 


ty of each value of v and of each 
re, of course, obtained by adding 
he lower right-hand corner of the 
eed add up to 1, and the bottom 


also multiplied by 128, the probabili 
value of Q. The marginal entries a 
rows and columns. As indicated in t 
table, the probabilities assumed do ind 
margin recapitulates Table 1. 

Conditional probabilities can e 
example, the conditional probability 
2/32, and the conditional probability t 
15/ 27. It will be seen in later sections 
Q is, in a sense, even more fundamental 
t and Q. 

There are 45 = 1,024 derived act: 
can be assigned arbitrarily to each ol 
observation. It is an easy exercise, 
Table 4, which shows the conditional expect 


Taste 4. El |2) 


x 


asily be read from Table 3. Thus, for 
hat x is 2, given that Q is 3, is 
hat Q is 2, given that {v is 4, is 
that the distribution of x given 
1 than the joint distribution of 


s, since one of the four basic acts 
f the five possible outcomes of the 
using Tables 2 and 3, to verify 
ation of the utility of each 


f 1 2 3 4 5 

0 0/21 0/27 0/32 0/27 0/21 
1 —7/21 11/27 82/32 43/27 49/21 
2 | —40/21 —20/27 8/32 44/27 72/21 
3 —94/21 —78/21 —48/32 18/27 74/21 
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basic act given each possible outcome of the observation. For each 2, 
the highest expected utility, given that value of x, has been italicized. 
Thus, for example, only if x is 1 will the person refrain from buying 
grapes altogether, and only if x is 5 will he risk buying 3 pounds. In 
full, the best derived act, call it g, is to buy 0, 1, 1, 2, or 3 pounds, if x 
is 1, 2, 3, 4, or 5, respectively. The value of the derived problem is the 
expected value of g, which is computed thus: 


a) Ele) = È Ele | 2)P(a) 


ll 


(0 + 11+ 32+ 44 + 74)/128 
161/128 œ 1.26 utiles. 


Since the value of the basic problem is 1 utile, the envisaged observa- 


tion is worth 0.26 utile; that is, the person would if necessary pay up 
to 0.26 utile for the observation. 


Exercise 


1. Suppose that the person could directly observe the quality of the 
grapes. Show that his best derived act would then yield 2 utiles, and 
show that it could not possibly lead him to buy 2 pounds of the grapes. 


The notion of a decision problem based on an observation will now be 
formally described, with special reference to mathematical notation and 
other technical details. 

1. There is a set of basic acts, F with elements f, f’, etc. 

In the example of the grapes F consisted of the four envisaged acts 
of buying 0, 1, 2, or 3 pounds of grapes. 

The convention laid down at the end of § 5.4, requiring that the con- 
sequences of acts be measured in utiles, will be adhered to, and it will 
be supposed that v(F) is finite. 

2. The observation is a (not necessarily real) random variable Xx 
associating with each state s an observed value x(s) in some set X of 
possible observed values x, 2’, ete. 

In the example of the grapes, the states s (of which the postulates 
require that there be an infinite number) were never fully described, 
and consequently the random variable x was not fully described either. 
In the same sense it may be said that the basic acts, which are also 
really random variables, were not fully described either. All that is 
really important, however, is to know the simultaneous distribution of 
the consequences of the acts in F and of the values of x. In the example 
of the grapes that information was implicit in Tables 2 and 3. 
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For mathematical simplicity in the formal work to follow, it will 
generally be assumed that X has only a finite number of elements, 
though the assumption can and must be relaxed in many practical 
situations. When X is assumed finite, the random variable x is, for 
all purposes of the present context, simply a partition of S, namely, 
the partition into the sets on which x is constant. Indeed, earlier in 
this section, the notion of observation was described in terms of a par- 
tition, but the description in terms of a random variable is more familiar 
in statistics and may have technical advantages, especially when the 
restriction that X be finite is relaxed. 

3. The set of strategy functions is the set of all functions associating 
an clement of F with each element x of X. Let the values of the generic 
strategy function be denoted by f(x) and the function itself by £(x). 

The notion of strategy function was not introduced in the informal 
description of observation, nor in the example of the grapes, because 
it is but a mathematical intermediary to the definition of derived acts 
and did not seem to call for explicit expression in the less formal con- 
texts. 

4. To each strategy function f(x) corresponds a derived act g, in the 


set of all derived acts F(x), defined by 


@) g(s) = f(s; (s)) forall s eS. 


he grapes there are 4° de- 


It was explained that in the example of t i 
l that if X has & 


rived acts. In the same way, it can be seen in genera 
and F has ¢ elements there are ¿ê derived acts. 
5. The value of F given v, 


68) v(F | z) =pt sup E(f | g). 


This is the function of x indicated, for the example of the grapes, 


by italics in Table 4. 
3 Multiple observations, and extensions of observations and of sets 


of acts 
associating elements of S 


If several random variables X1, ***, Xw z } 
ously under discussion, 


with elements of sets X1, ** °» Xm are simultane 
it is natural to form the new random variable, denoted x = {x1, +++, 
Xn}, that associates with each element of S an ordered n-tuple of ele- 
ments of Xj, ---, Xn, respectively. If the context is such that xı, +++, 
Xn are thought of as observations, then x can also be thought of as an 
observation and will sometimes be called a multiple observation—to 
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emphasize the manner of its formation. To illustrate, any item such 
as profession or body temperature that might be entered on a patient's 
history can be thought of as an observation; but the whole history, or 
a filing cabinet of histories, can also be thought of as an observation, 
the history being a multiple observation of items, and the cabinet a 
multiple observation of histories. 

Consider two observations x and y. It is an interesting possibility 
that x and y are so related to each other that knowledge of the value 
of x would (almost certainly) imply (almost certain) knowledge of y. 
In that case, observation of x implies essentially the observation of y 
and generally something besides, which suggests the following three 
definitions. 

If and only if x and y are observations such that, for all s and s’ in 
some B of probability one, x(s) = x(s’) implies y(s) = y(s’); then x is an 
extension of y, and y is a contraction of x. If x is an extension of y, 
and y is an extension of x, then x and y are equivalent. 

Strictly speaking, one should say not that x and y are equivalent, 
but rather that they are equivalent regarded as observations, for this 
would not be a good concept of equivalence to apply to random varia- 
bles regarded as such. For example, a pair of equivalent observations 
can obviously be a pair of real random variables with different expected 
values. Some properties of the relations of extension, contraction, and 
equivalence between observations are given by the following easy but 
important exercises. Throughout this set of exercises it is unnecessary 
to suppose the observations confined to a finite set of values; in the case 
of Exercise 3b, it is impossible to do so. 


Exercises 


1. x and y are equivalent, if and only if x is both an extension and a 
contraction of y. 

2a. If P{x(s) = y(s)} = 1, x and y are equivalent. 

2b. Any observation x is equivalent to itself. 

3a. If there is a value yo such that P{y(s) = yo} = 1, then every 
x is an extension of y, and any two such observations are equivalent. 
Such an observation, of course, amounts to observing nothing at all 
and will therefore be called a null observation. 

3b. If a(s) = s for almost all s eS, then x extends every y. 

4. If x is an extension of y, and y is an extension of z, then x is an 
extension of z. State and verify the analogous fact about equivalence. 

5a. If y’ is a function associating an element of Y with each element 
of X, and x is an observation, then the observation y such that y = 
y'(x) is a contraction of x. 
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5b. If y is a contraction of x, then there is a function y’ such that 
P{y(s) = y'(x(s))} = 1. What freedom is there in the choice-of the 
function y’? 

5c. What are the implications of Exercises 5a and 5b for equivalence 
between observations? 

6. If x and y are observations and z = {x, y} is the corresponding 
double observation, then z is an extension of x and of y. (This exercise 
seems to call for a converse saying that every extension can be regarded 
as a double observation, but no really neat one suggests itself to me. 
None the less, in thinking about extensions and contractions, the sort 
brought out by the exercise is a typical and stimulating example.) 

7. {x, y} is equivalent to x, if and only if x extends y. 

The relations of extension, contraction, and equivalence have paral- 


lels for sets of acts, defined thus: 
If F and G are (non-vacuous) sets of acts such that, for some B of 


probability one, there is for each g € G an f £F with f(s) = g(s) for all 
s e B; then F is an extension of G, and G is a contraction of F. If F is 
an extension of G, and G is an extension of F, then F and G are equiv- 
alent. 


More exercises 


8. If F is an extension of (equiv: 

9. Discuss the analogues of Exercises 1, 2 
acts. 

10. If F > G, then F extends G. 

11. If F(x) is derived from F on observa 


alent to) G, then v(F) 2 (=) o(G). 
b, and 4 for sets of 


tion of x, then F(x) extends 


12. Hyp, 
F(x) is derived from F on observation of x; 
F(y) is derived from F on observation of y; l 
F(x, y) is derived from F on observation of {x,y}; 
F(x; y) is derived from F(x) on observation of y. 


Concn. 


z F(x, y) is equivalent to F(x; y). 

- F(x, y) extends F(x) and F(y). ; 

3. If x is equivalent to y, then F(x) is equivalent to To, i N 
4. If y extends x; then F(x, y) is equivalent to F(y), F(y) is equiva- 

lent to F(x; y), and F(y) extends F()- 
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13a. Under the hypothesis of 12, the equivalences and relations of 
extension among the sets of acts arising out of two observations can, 
with evident conventions, be diagrammed thus: 


x >0< 


13b. If y extends x, the diagram becomes 


SY Xy y;x ylox—0. 


13c. If x and y are equivalent, the diagram becomes 


=> 0. 


x y 
a | 
14. If F(x) and G(x) are derived from F and G, respectively, and if 
F extends G, then F(x) extends G(x). 


15. v(F(x)) = EDp(F | x] = fe | x(s)) dP(s) > v(F). 


4 Dominance and admissibility 


According to Exercise 3.14, if one set of acts, regarded as basic, ex- 
tends another, the first is at least as valuable as the second in the light 
of any observation whatever. This section explores a relation, domi- 
nance, which has the same property but is not so strict as extension. 
Dominance is of some importance for the theory of personal probability 
as it has been developed thus far. But its importance will be even 
greater in the study of statistics proper, where interpersonal agreement 
is of particular interest; for, as the definition shortly to be given will 
make clear, two people having different personal probabilities will agree 
as to whether one of two sets of acts dominates another, if only they 
agree which events have probability zero—a condition generally met 
in practice, and one that could if desired be dispensed with by a slight 
change in the definition of dominance. 

It will be seen that dominance and notions related to it are intimately 
associated with the sure-thing principle. Indeed, probability being 
taken for granted, the basic facts about dominance seem to give a com- 
plete expression of the sure-thing principle. Dominance and related 
concepts were much stressed by Wald, in [W3] for example. 


6.4 x 
] DOMINANCE AND ADMISSIBILITY 115 


os a ty notions, the logical connections among them, and those 
‘Some hal and extension, are to be treated. The logical connec- 
aes 18 eid but simple, I think that the material lends itself bet- 
st vont han to expository treatment, for in such a context the 
eet ho looks for the motivating ideas sees them himself more easily 
onal — someone else’s verbalization of them. This sec- 

herefore consist primarily of a group of formal definitions and 


Several exercises. 


ee only if P(f(s) > g(s)) = 1, £ dominates g. If and only if some 

=“ = ) element of F dominates (is dominated by) g, F dominates (is 

Fa, ated by) g. If and only if F dominates every element of G, 

te Hee G. If and only if f dominates 8, but g does not dominate 

te hi dominates g. If and only if f ¢F, and f is not strictly domi- 
by any element of F, f is admissible (with respect to F). 


as well as sets of acts, the definitions, 
fferent kinds of dominance. How- 
ted, with a slight lapse of logic, by 
acts of which f is the only element, 
ation is in such harmony with the 
four kinds of dominance collapse 


ae as they do acts 
Svar —. introduce four di 
identig his complexity can be allevia 
or tying each act f with the set of 
dies 18 easily seen that this identific 
definition that, once it is made, the 
nto one, 


Exercise 
s 
cises 3.2b and 3.4. 


h other? 
G. Discuss the converse. 


la. Consider analogues of Exer 
= When can two acts dominate eat 
a. If F extends G, then F dominates 

2b. F(x) dominates F. 

2. FFD G, then F dominates G. 

3a. If F c G, and F dominates G, then all admissible elements of G 
are contained in F. 
prs any finite number of non-admi 
ee what remains of any subset of F that 

minate F. 

3c. Though the set of admissible elements of F may in some instances 
dominate F, no proper subset of the set of admissible elements can ever 
do so; but, if any other subset dominates F, some proper subset of it 
also does Ei s 

3d. If F is finite, the set of admissible elements of F dominates F. 

3e. Discuss the role of “finite” in 3b and 3d. 


ssible elements is deleted 
dominated F continues to 
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4a. If the set of admissible elements of F dominates G, and G domi- 
nates F, then the set of admissible elements of F is equivalent to the 
set of admissible elements of G. 

4b. If F and G dominate each other, and either is finite, then the 
sets of admissible elements of F and G, respectively, are equivalent to 
each other, and each dominates both F and G. 

5. If F dominates G, then v(F) > v(G). 

6. If F dominates G, then, for any observation x, F(x) dominates 
G(x). 


5 Outline of the design of experiments 


Often, especially in statistics, a decision problem can be seen as the 
problem of deciding which of several experiments—or which of several 
oo programs, if that is really a more general term—to under- 
take. 

In this section the notion of the decision problem derived from a 
basic decision problem and an observation must be elaborated a little, 
because, as derived acts have been treated thus far, they correspond to 
the possibility of making an observation free of charge. Though obser- 
vations are sometimes free, there is typically a cost associated with 
making them; information must typically be bought either from other 
people or, more often from nature, so to speak. The cost of informa- 
tion may be money, trouble, one’s own life, that of another, or any of 
innumerable possibilities, but all can in principle be measured in terms 
of utility. The cost of an observation in utility may be negative as 
well as zero or positive; witness the cook that tastes the broth. 

In principle, if a number of experiments are available to a person, he 
has but to choose one whose set of derived acts has the greatest value 
to him, due account being taken of the cost of observation. That simple 
formulation, like some others in this book, is, in a sense, oversimple; it 
abstracts from the enormous variety of considerations that enter into 
the careful design of any experiment. The possibility of so abstracting 
from variety does not remove the ultimate necessity of studying some 
aspects of that variety in detail. R. A. Fisher’s The Design of Expert- 
ments [F4], for example, is concerned almost exclusively with experiments 
based on a special technique called the analysis of variance, and it is 
but an introduction to even that important facet of statistics. Again, 
there is a growing literature (in which the work of A. Wald is outstand- 
ing) on sequential analysis, which is concerned in principle with all ex- 
periments in which later parts of the experiment are conducted in the 
light of what happens in earlier parts; but this literature has, by neces- 
sity, been confined to a relatively tiny part of that domain. 
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Before turning to a more formal recapitulation of the outline of the 
design of experiments, this may be a good place for a few speculative 
words about the difference, if any, between experiment and observation. 

Some sciences are commonly called experimental as opposed to others 
that are called observational. Aerodynamics, the psychology of rote 
learning, and the genetics of fruit flies would typically be called experi- 
mental sciences; and, to take parallel examples, meteorology, the psy- 
chology of dreams, and human genetics would be called observational. 
But it is widely agreed, and the most casual consideration makes it 
clear, that any basic difference that may really be present resides not 
in the sciences themselves but in the methods typical of each. To illus- 
trate the role of observation in sciences ordinarily considered experi- 
mental and vice versa, observations of wild populations of fruit flies 


have been useful in the study of the genetics of fruit flies; the effects of 
fatigue, for example, on dream content may well be the subject of an 
ic in science is more popu- 


experiment; and, except for the atom, no top 
lar today than experimental rain making. The illustrations could be 
extended indefinitely, and there is also a less direct: sort exemplified by 
the discipline called experimental medicine, which typically studies ex- 
periments on animals with the hope, often justified, that the findings 


thus obtained can be extrapolated to humans. 


The problem, then, is to distinguish an experiment from an observa- 
er to say mere observation, 


tion. Except for brevity, it might be bett 3 4 
for, in general usage, an experiment would be considered a special sort 
of observation. rit ; 

_ The first apparent contrast that comes to mind is that experimenta- 
tion is generally thought of as active and observation as passive. But, 
Upon examination, it is seen that observation is also active, for obser- 
vations are typically made by going somewhere to observe, or waiting 
attentively till something happens. Often it is not only the observer 
himself who must be transported and put in readiness to make an ob- 
Servation, but also a considerable body of apparatus. What demands 
More activity than the modern observation of a solar eclipse? L 

Another apparent contrast is that the experimenter acts on the mag 
n observes, whereas the observer acts only on himself an o ru- 
ments of observation that may be regarded ax extensions of his oya 
Sense organs. If this criterion were accepted altogether naye y iss 
Would be no such thing as a physiological ey aa bg: ee hae 
even sophisticated interpretations might find it © cu 


Psychologi i ’s self 
sper ne’s sell. 
ogical experiments On O = are commonly sup- 


Finall i d to observatio ; 
y, experiments as oppose ee k 
Posed to be eee = by reproducibility and repeatability. But 
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the observation of the angle between two stars is easily repeatable and 
with highly reproducible results in double contrast to an experiment to 
determine the effect of exploding an atomic bomb near a battleship. 
All in all, however useful the distinction between observation and ex- 
periment may be in ordinary practice, I do not yet see that it admits 
of any solid analysis. At any rate, no formal use of the distinction will 
be attempted in this book. 

Return now to the notion of observation subject to cost. It may be 
that the value of the random variable x is observable but only at a 
cost c, a real-valued random variable measured in utiles. If, as hereto- 
fore, F(x) denotes the set of acts derived from F on cost-free observa- 
tion of x, let F(x) — c denote the set of derived acts subject to the ran- 
dom cost c. This notation is interpreted to mean that, if f is the generic 
element of F(x), then f — c (which, being a utility-valued function of 
s, is an act) is the generic act of the set F(x) — c. Very often the cost 
of an observation is independent of s, but not, for example, for him that 
tests the sharpness of a thorn with his finger. Since observations are 
typically paid for before, or simultaneously with, making the observa- 
tion, the cost is typically observed along with the observation proper. 
Put differently, the cost c is typically a contraction of the observation 
x. Thus, if in some special context any advantage were to be gained 
by so doing, it would not be drastic to assume the cost of observing x 
to be a function of the form c’(x); but, as a matter of fact, no such ad- 
vantage has come to my attention. It is not difficult to think of ex- 
periments to which the assumption does not apply. For example, in 
the present state of uncertainty about the long-term effects of x-rays; 
anyone conducting a short-term experiment in which young human be- 
ings were subjected to large doses of x-radiation would risk costs that 
might not overtly manifest themselves for half a century, or even for 
generations. 

Much that would ordinarily be called observation cannot be described 
by saying that the random cost is simply to be subtracted from each de- 
rived act of the corresponding observation thought of as free of cost. 
Allowing that it may be legendary, the form of trial by ordeal in which 
the guilty floated safely to be hanged and the innocent drowned to be 
exonerated epitomizes such a situation; except in point of absurdity, 
ordinary industrial destructive testing of electric fuses and other prod- 
ucts is much the same. Strictly speaking, discrepancy occurs even in 
the ordinary context in which the cost of observation is a fixed sum of 
money; for the utility of money is not strictly linear, so the cost of ob- 
servation typically affects different derived acts somewhat differently. 
This sort of situation is indeed so common as to introduce at least a 
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slight error into almost every application of the notion of cost as a sub- 
tractive term. It would therefore be desirable to extend considerably 
the notion of cost of observation, but, thus far, I see no way to do so 
that does not destroy the mathematical advantage of singling problems 
of observation out of the class of decision problems generally. 

It is convenient now to analyze the appropriateness of regarding the 
number v(F) as a measure of the value of F. As must already be clear 
to tke reader, if a person is to make a preliminary decision limiting his 
next decision to one or another of several sets of acts, say, F, G, and H, 
then his preliminary decision will select a set that has the highest value 
of v, and the preliminary and secondary decisions, regarded as a single 
grand decision, amount to the problem of deciding on an act from 
FUGUH. So far as this use of v is concerned, any increasing mono- 


tonic function of v such as v? or 3° would be equally satisfactory, but v 
has an advantage in arithmetic simplicity when costs of observation 
blem of whether to make 


are involved. Consider, for example, the pro : 
a particular observation at the random cost ¢ or to make no observation 
at all. The two sets of acts involved may then be symbolized by 
(F(x) — c) and F, respectively. The peculiar simplicity of v as a meas- 
ure of the value of a set of acts, in this context, is exhibited by the almost 
obvious fact that v(F(x) — c) = v(F(x)) — Ee). It may be remarked 
in passing that v is a particularly good measure in any pon There 
: G, or H is, so to speak, made available by lot, a possibility realize 


in (7.3. © i 
S o ations is to be chosen, each with 


Fin aii BIN 
ite bapi be apare ari the null observation), the per- 
Son will choose an observation for which (F(x) — E(c) 18 as large as 
Possible, If the number of observations among which derinio p to 
be made is infinite, that function may not attain & Sae be 
but the value of the situation to the person can reasonably i isa 
as the supremum of the function; there are, of Lge tee fe emily 
among those available for which the supremum 15 AT AY y 


attained. 


CHAPTER 7 


Partition Problems 


1 Introduction 


In the introduction of the preceding chapter it was explained that 
the treatment of decision problems in general had been carried to a 
logical conclusion, and that to study decision problems further it had 
become necessary to specialize. The notion of observation was accord- 
ingly chosen as the subject of specialization. The situation now re- 
peats itself at a new level, for I have now covered the main points that 
occur to me about observation in general, though I see considerably 
more to say about a certain type of observation. 

The type of observation problem to which the present chapter is de- 
voted, though relatively special, is still very general. Indeed, its gen- 
erality is suggested by the fact that no other type of problem is syste- 
matically treated in modern statistics. In objectivistic terms, it would 
be described as the type of decision problem in which the consequence 
of each basic act depends only on which of several (possibly infinitely 
many) probability distributions does in fact apply to the random vari- 
able to be observed. 

Modern statistics has no name for this type of problem, because it 
recognizes no other type; and no particularly suggestive name occurs 
to me. I am therefore tentatively adopting the noncommital name 
“partition problem.” Such motivation as there is for that name will 
be apparent when the concept is defined. 

In non-objectivistic terms, a partition problem has the following 
structure. There are, of course, basic acts F and an observation v. 
The peculiar feature is a random variable b, which is typically not sub- 
ject to observation, with the property that every f in F is constant 
given that b has any particular value b. 

In many practical problems b takes on an infinity, even a non-de- 
numerable infinity, of values, but systematic consideration of such 
problems would involve those advanced mathematical techniques that 
are explicitly being avoided in this book. Glossing over such questions 
of technique for the moment, the state of the world, which is itself & 

120 


7.2] STRUCTURE OF (TWOFOLD) PARTITION PROBLEMS 121 


random variable, might play the role of b; with respect to this b, any 
observational decision problem would presumably be a partition prob- 
lem. It may, therefore, be inaccurate to call partition problems special, 
but they are special whenever b is not equivalent to the state of the 
world. 

As has just been mentioned, the general policy of this book with re- 
spect to mathematical technique restricts formal treatment of partition 
problems here to those in which b assumes only a finite number of dif- 
ferent values, that is to say, those in which b is to all intents and pur- 
Poses a partition B; whence the name “partition problem.” For the 
reader who is not familiar with the elements of the geometry of n-dimen- 
sional convex bodies, there will be a distinct expository advantage in 
confining the formal treatment still further to twofold partitions. At 
the same time, by explicit statements and by the use of suggestive no- 
tation, all readers will be given at least some idea of the extension of 
the theory to n-fold partitions; indeed, a reader familiar, for example, 
with Sections 16.1-2 of [V4], or with [B20] will find the extension as 
plain as if it had been made explicitly. Thus the restriction to twofold 
as opposed to n-fold partitions will be to the advantage of some and to 


the disadvantage of none. 


Partition problems are even closer than are observational problems 


generally to the subject matter of statistics proper. In particular, in 
the course of this chapter, multipersonal considerations will from time 
to time be pointed out in connection with partition problems. 


2 Structure of (twofold) partition problems 


A centr -ofold partition problem is, of course, a two- 
entral feature of a twofold p By way of abbreviation let 


fold partition, or dichotomy, B i=l, a ; ae 
8G) = PCB), ands = (20), BON TEO een o a E 
Such that (i) > 0 and ZA(@) = B) + 8) SA 1. e f aut 
8(1), it might seem superfluous to have a special notation tor BC y ut 
this redundancy more than pays for itself in symmetry, eae ae 
the extension of the theory to n-fold partitions. The Boke 5 a y j? Pi 
One of the B(2)’s vanishes has been ruled out, for it is net 3 a e nor 
interesting, and its retention would mar the exposition of the theory. 


Each basic act f ¢ F is characterized by & pair of numbers fi such that 


9 Pos) = fil Bd =} 

n will be made that as f ranges 
om above for each 7, which is a 
ption that v(F) < %. 


for each 7, The technical aeons 
Over F the numbers fi are bounded from 
little more stringent than the now familiar assum 
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The assumption expressed by (1) is made for definiteness and sim- 
plicity, though its full force will seldom be used. The possibility of re- 
laxing (1) in certain contexts will be mentioned from time to time, es- 
pecially since this possibility is of some interest even in the exploitation 
of (1) itself. In particular, for several pages now it will scarcely ever 
be necessary to assume anything about the structure of F relative to 
B;, except that E(t | B;) is bounded from above for each 7; for making 
the abbreviation f; = E(f | B;), almost everything from here through 
Exercise 1 applies verbatim. 

The expected utility of any f  F can be computed in several forms 
thus: 


(2) E(f) = E@ | Bı)P(B1) + EC | Bo)P(Be) 
= fiB(1) + S282) 
= Zf) 
= fa + (fi — fe)B()). 


The first of these forms expresses the expected value in general terms; 
the second utilizes abbreviations; the third is an obvious mathematical 
transcription of the second, particularly suggestive of extension to the 
n-fold situation; the fourth sacrifices the symmetry exhibited by the 
preceding three in order to take advantage of the relation between 
B(1) and 6(2). From the fourth form of (2), it is clear that, for fixed f, 
E(f) is a linear function of 6(1). Henceforth that fact, for example, 
would be expressed in symmetric form by saying that E(f) is linear in 
8, and the dependence of E(f) on 8 might be explicitly indicated by 
writing E(f | B). 

Since in any one decision problem £ is constant, it might seem point- 
less to emphasize that Æ(f | 8) is linear in 8. But there are, in fact, two 
different reasons for being interested in variation of B. In the first place, 
once the observation x has been observed to have the value x, the basic, 
or a priori, decision problem is replaced by an a posteriori problem in 
which P(B; | x) plays the role originally played by P(B,) = B(é). Sec- 
ond, interest in comparing different people is becoming increasingly 
more explicit as the book proceeds. In particular, it is of interest to 
compare people who have available the same set of basic acts and who, 
at least so far as the distribution of x and the acts in F are concerned, 
have the same conditional personal probability given B; but who at- 
tach different probabilities @(¢) to the elements of the partition. 
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To emphasize its dependence on 8, v(F) will sometimes be written 
v(F | 8); its computation in the following fashion is fundamental to 
the theory of partition problems. 


6) o(F | 8) = sup EEA) 


sap M81) + f28(2)] 


k(), 


where (8) is defined by the equation in which it occurs. According to 
Exercise 4 of Appendix 2, the function k is convex in 8, that is, k is 
convex when recognized as a function of (1) alone. Interpreted as a 
pair of a priori probabilities, 8 is confined to the open interval defined 
by 8(j) = 1, A(z) > 0, but it is valuable to recognize that k is defined, 
convex, and continuous on the closed interval =@(j) = 1, Bi) 2 0. 
Many typical features of the relationship between F and B; are illus- 
trated graphically by Figure 1. The abscissa of that graph represents 


Figure 1 


both B(1) and g(2), as indicated, and the ordinate is meae i Te 

he straight lines, the left ends of which are mara 7 on z 7 n 
graph as functions of 6 the expected values of tie m age A 
Particular problem represented. The gian a ae ve and fils 
ends, respectively, are the corresponding V# ues o 1 9’S. 


The graph of k is marked by heavy line segments. A i a = 
lines a, c, and e, and they alone, touch the graph of K; tor they top 
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sent the only acts that are optimal for some value of 8. The act repre- 
sented by d is inadmissible (if (1) is taken literally), being in fact strictly 
dominated by every other act except e, and it is therefore superfluous 
to the person, no matter what the value of 8; b is obviously equally 
superfluous, but for a different reason. d 

In many typical problems in which F has an infinity of elements, k 
is, unlike the k in Figure 1, strictly convex; that is, its only intervals 
of linearity are point intervals. 


Exercise 


1. Compute and graph k for the set F of dichotomous acts of the 
form 
Ale) =1- (1+ 9); 
—2<5¢<5 +2 
fold) =1- (1 — 4)’; 


Answer. (8) = (8(1) — 6(2)? = [28(1) — 1}. 


Turn now to the relations between an observation x and the dichotomy 
B;. As before, it will be assumed for mathematical simplicity that the 
values of x are confined to a finite set XY. The probability that x at- 
tains the value x given B;, written P(x B;), is fundamental in connec- 
tion with partition problems. For one thing, as has already been indi- 
cated, there is interest in considering people who, though differing with 
respect to $, agree with respect to P(x | B,). The probability P(x, B:) 
that x attains the value x and that B; simultaneously obtains, the proba- 
bility P(x) that x attains the value z, and the probability 8(¢ | x) of Bi 
given that «(s) = x are derived from P(x | B, and 8 by means of Bayes’ 
rule (3.5.4) and the partition rule (3.5.3) thus: 


(4) P(x, B) = P(e | BBQ). 
(5) P(«) = a P(x, Bì). 
(6) Bl |x) = P@, B)/P@), 


if P(x) = 0; and B(| 2) is meaningless otherwise. It must be remem- 
bered that P(x, B,), P(x), and (| x) depend on the value of £ and that 
a really complete notation would show that dependence. On the other 
hand, the condition that P(x) # 0 is independent of the value of £. 
When a second observation y is to be discussed, B(¢ | y) is, in defiance 
of strict logic, to be understood as the analogue of gli | x); that is, as 
the conditional probability of B; given that y(s) = y, not as the same 
function as B(z | x) with y substituted for x. Corresponding conven- 
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tions apply to P(y), P(y| B), and P(y, B;). Finally, free use will be 
made of such contractions as 8(x) for {8(1 | x), 82 | 2)}. 

Equation (1) implies that 
(7) E| B; x) = EE | Bd 
for all f ¢ F and for all x such that P(e | B:) > 0. Equation (7) is the 
mathematical essence of the concept of a partition problem, and vir- 
tually all that is to be said about partition problems applies verbatim, 
if (7), even without (1), applies to such observations as may be under 
discussion. 

In view of (7), 


(8) EE| 8, 2) = D E| Bs 2)P(B: | 2) 


= X fæli] 2), 
if P(x) > 0. $ 
3 The value of observation 
and it is found that x(s) = x, then the 


ts, written o(F | x), or more fully 
from the a priori value o(f | 8). 


If the observation x is made, 
a posteriori value of the set of basic ac 
x(F | 8, x), will typically be different 
Indeed, in view of (2.8), 


w o(F |, 2) = sup BEL, 2) 


= v(F | B(2)) 
= k(6(2)). 


al convenience of the function k. 
t v(F(x)) 2 v(F), but there is 
in the present context; in 
ust when equal- 


This is the first illustration of the technic 
’ It is known on general principles that 
Some interest in reverifying the inequality 1 : 
Particular, it is possible here to say in interesting terms J 
ity can obtain. 
(2) IEE) | 8) = BOC | a) | A) 
= EEEH) | 8) 


> K(B@@) |), 

Where the terminal inequality is an application. of Theorem 1 of rie 
dix 2. ‘To appreciate the inequality (2), it 38 ama to R ate 
EG] x) explicitly. This calculation, typical of many the TAT must 
henceforth be expected to make for himself, runs as follows, where it is 
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to be understood that the summation with respect to x applies only 
to those terms for which P(x) is different from 0. 


(3) E(B | x) | 8) = X BG | 2) P(e) 


P(x) 
= DPC, Bi) 


= P(B;) = B(i). 
Substituting (3) into (2) leads to the anticipated conclusion that 
(4) o(F(x) | 8) > k(8) = o(F | 8). 


According to Theorem 1 of Appendix 2, v(F (x) | 8) is definitely greater 
than v(F | 8) unless B(x) is confined with probability one to some inter- 
val of linearity of k, in which case the observation x may fairly be 
called irrelevant to the basic decision problem at hand. If x is irrelev- 
ant, the interval of linearity to which £(x) is confined must, in view of 
(3), contain 8. In the particularly interesting case—and the only pos- 
sible one, if k(8) is strictly convex—in which B(x) is with probability 
one equal to a constant value, that value must therefore be 8. An ob- 
servation for which 8(x) is with probability one equal to 6 may fairly 
be called utterly irrelevant, because it is irrelevant no matter what set 
F of basic acts is associated with the dichotomy. 

To say that x is utterly irrelevant is to say that, with probability 
one, 


P(e | BBG) 


(5) BG| 2) = PG) 

= B(i). 
Since B(z) > 0, (5) is equivalent to the condition that 
(6) P(e | B) = P@), 


at least when P(x) > 0. Furthermore, it is obvious from (2.5), again 
noting that 6(z) > 0, that, if P(x) = 0, then P(x | B = 0. Therefore 
x is utterly irrelevant, if and only if (6) holds for all x and 7; that is, if 
and only if the distribution of x given B; is independent of 7. This form 
of the condition is intuitively evoked by the words “utterly irrelevant” 
and has the advantage of not involving £. 

It is noteworthy that whether an observation is utterly irrelevant 
depends neither on the particular set of basic acts, nor on the value of 
B, so people will agree on what is utterly irrelevant independent of their 
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personal a priori probabilities and the acts among which they are free 
to choose. 

The greatest lower bound in x of v(F (x) | 8), namely v(F | 8), and the 
circumstances under which this bound is attained having been estab- 
lished, it is natural to turn to a parallel investigation of the least upper 
bound. A foothold for that investigation is found in the remark that 
the chord joining the ends of the graph of k never lies below the graph. 
Analytically, 

(7) k(B) < BEC, 0) + B(2)K(O, 1) = 18), 
where 1(8) is defined by the context. Unless one of the @(i)’s vanishes, 
equality holds in (7), if and only if k(8) is a linear function. In view of 
(7) and (3), 
8) oE) | 6) = EEG) | 8) < ENBW) | A) = KA. 
The inequality (8) gives an upper bound for v(F(x)). In graphical 
terms it says that, for any 8, no observation can add more to the value 
k(8) of F than the vertical distance at 6 between the graph of k and 
the graph of the chord joining the ends of k. 

i in which case the upper and 


Equality obtains in (8), if k is linear, n w 
lower bounds are equal to each other irrespective of the value of £ and 


the nature of the observation. If F is dominated by a single f, that is, 
if there is a single f optimal given B; for both values of t, then k is linear. 
It can easily be verified that, provided F is finite and (1) actually ob- 
tains, this is indeed the only cireumstance under which k is linear, and, 
even if these provisions are not satisfied, the possibilities are not much 


More interesting. 
Suppose, then, that k is not linear; equality can hold in (8), if met 
only if B(x) is with probability confined to the ends of the interval, a 
condition that does not depend at all on F. By simple considerations, 
Which have by now been rendered familiar, this condition on x is equiv- 
alent to the condition that 
@) P(x | B)P( | B2) = % k 
for all x, An observation satisfying (9) may fairly be “er fetais, 
because, if (1) obtains, such an observation removes “ uncertainty 
abon f < F, no matter what 6 may be. 
veo T T abel made in everyday life are defini- 
Old Mother Hubbard looked in the cup- 
o the vanishing point. None the less, 
definitive observations do not play an important oes n k aa 
theory, precisely because statistics 18 mainly concer 
tainty, and there is no uncertainty once an O 
the context at hand has been made. 
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4 Extension of observations, and sufficient statistics 


It was shown in § 6.4 that a statistic, or contraction, y of an obser- 
vation x is never worth more than x and is typically worth less. The 
purpose of the present section is to explore the relation between an ob- 
servation and a contraction of itself in the case of a partition problem, 
especially to explore the special conditions in that case under which the 
statistic is as valuable as the observation itself. 

Let x and y be two observations such that y is a statistic of x, that 
is, such that, for some function y’, y(s) = y’(x(s)) with probability one. 
The values of F(x) and F(y) can be compared by the following calcula- 
tion, which in the light of the preceding section will need but little ex- 
planation. 


(1) o(F(x)) = £(k(8(x)) | 6) 
= D Ek) | 8, PQ). 
Y 


(2) E(k(6(x) | 8, y) > k(E@(x)) |B, y)), 
if P(y) > 0. ` 
(3) EB] x) |8, y) = Z Lal «P| y) 


= y EElOPE, v) 


if P(y) > 0. E AAU 

Because of the special relationship between x and y, P(x, y) = 0 un- 
less y'(x) = y, in which case P(x, y) = P(x). Understanding that the 
summation indicated by X’ in (4) below extends only over those values 
of x for which y’(x) = y, the calculation is continued thus: 


, 


, P(e, Bi) P(e) 
P) P) 
= 3 P(«, B;) 
Pty) 
_ PY, Bi) 
PY) 
= BG y). 


(4) E(6G|x)|B,y) == 


Therefore, 


(5) o(F(x) | 8) > E kB) PY) = (Fy) | 8). 
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Shes the preceding section, it seems almost superfluous to explain 
(5) i a of the calculation above is not to obtain the inequality 
aie N oa has already been derived with less labor and greater gener- 
it ee xercises 6.3.8 and 6.3.13b, but to be able to discuss when equal- 
( 5) : " s in (5). The calculation makes it clear that equality holds in 
an if and only if equality holds in (2) for every y of positive probability. 
ae in turn is equivalent to the condition that, given y, B(x) is confined 
ati probability one to an interval of linearity of k. A sufficient con- 

ition for that is that, given y, B(x) be confined with probability one to 
7 single value, which cannot be other than (y); if k is strictly convex, 
e almost certain confinement of 6(x) to B(y) is also necessary. Now, 
w for every y of positive probability, P(8(e(s)) = Bly) |y) = 1, then 
it is true that B(x) = A(y) with unconditional probability one, that is, 


wd P(a(x(s)) = B@(s))) = 1 

The condition (6) clearly does not depend on F, and the following 
calculation so expresses it as to make clear that it does not depend on £ 
either, Equation (6) is satisfied, if and only if 
P(e | BIBO _ PU'E) | BIO, 


(7) 
P(x) P(y'(2)) 


when P(x) > 0; or, if and only if 
(8) P(x| B) _ P@), 
Pul B) PY) 


when P(e | B) > 0; or, again, if and only if 
o Ple | Bi, y) = PE y), 


d only if P(e 
it is defined. 


When P(y | B;) > 0; or finally if an | Bi, y) is independent 
of ¢ for those values of å for which In this form, and yet 
another to be derived in connection with (10), the condition is widely 
Studied in modern statistical theory and a statistic satisfying the con- 
dition is there called a sufficient statistic. The name is well justified 3 
or, as has just been shown, it is sufficient, for any purpose to which x 
Might be put, to know y, if and only if Y i flicient statistic for x. 
oe different, and perhaps more congenial, approach to sufficient sta- 
tistics is the Sollowin® If the pe s the particular value y 


Gly; Tis oait haaie UEU is re 
» his original basic decision problem is 1 : V 
he same basic acts, but with 6 replaced by p(y). Strictly speaking, 


this will fai sik - se B(y) is (0, 1) or (1, 0), or: 
fail to be a partition problem, in 1" jae EAT, 
or brevity, if as pate reme. To see whether (E(x) | 8) is really greater 


is a su 


son observe! 
placed b 
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than v(F(y) | 68), it is enough to investigate whether, for some y of posi- 
tive probability for which (y) is not extreme, x is relevant to the par- 
tition problem based on f(y), for if B(y) is extreme there can be no value 
in following the observation that y has occurred by the observation of 
x. Therefore, x will be a worthless addition to y, if, for every y for 
which A(y) is not extreme, x is utterly irrelevant, that is, if y is sufficient 
for x. If k is strictly convex, the condition is also necessary. 

The recognition of sufficient statistics in explicit problems is often 
facilitated by the following factorability criterion. A statistic y is suffi- 
cient for x if and only if there exists at least one pair of functions R and 
S such that 


(10) P(e | B) = RYy'(a); 4)S(@). 


The necessity of the condition follows from the exhibition of a particu- 
lar R and S for a sufficient statistic thus: 


(11) P(e | B) = È P(e | Bi, y)PYy| Bò 
= TV P(e | v)PU | Bò 


= P(y' (x) | B)P(a| y'(2)). 


On the other hand, if P(e | B;) can be expressed in the form (10), Y 
can be seen to be sufficient for x thus: If P(« | B;, y) is meaningful, it 
is given by 


P(x, y | Bi) 
PU | B) 
=0, if y'(x) = y, 
P(x | Bì) . 
i Pw| Bo’ if y/(x) = y, 


(12) P(r | Bi y) = 


S(x) 
x se’) 


ulz')=y 
which is independent of i. The reader may be interested in asking 
himself, as an exercise, what freedom there is in choosing R and S when 
at least one such pair of factors exists. 

Interest in sufficient statistics is not confined, of course, to twofold, 
or even finite, partitions. With that in mind, the various criteria for 
sufficient statistics have been given in such terms as to be valid for any 
finite partition and the usual infinite ones. They require some modifica- 


7.4] SUFFICIENT STATISTICS 131 


tion if the observations are not confined to a finite, or at any rate de- 
numerable, set of values, but formal details of that important extension 
will not be given here. Elementary treatments are given in most text- 
books of mathematical statistics; more advanced and general treat- 
ments are given in [B2], [L6], and [H3]. 

There are several examples of sufficient statistics in the exercises 
below, others are given in almost any fairly advanced textbook on sta- 
tistics Gn particular, in [C9]), and one other general example of extraor- 
dinary importance is treated in the next section. 


Exercises 

a multiple observation x = fi eee 
dependent and identically distributed. 
here in thinking of the partition as 
he exercises it will be imprac- 


In these exercises, let x denote 
Xn}, where, given B; the x,’s are in 
There will be no real advantage 
twofold, or even finite, and for some of t 
tical to do so. 


1. Let P(x, | B) =p, ft 5l 
= iy if t, = 0, 
= 0, otherwise, 

Where p; + q; = 1; and let y'(z) = 2 tr. 


Show that: 
(a) Ple | B) = plait 
(b) y is sufficient for x, usin 


(c) Py | B) = (") pig”, where, as always, 
y 


( Į n i 
d relve = (r) ; 
9 ya) 

. For each positive integer t, let 


palper Pash 


g the factorability criterion; 


() = n!/y!(n — y)!; 
y. 


a 


1l 


=s otherwise, 
aa ? 


itive integers; and | 
Where the values of x, are confined to the positive integers; an et 


Y (x) = maxz, Show that: 
@) P@|By=i, ify S4 


=0, otherwise; 


(b) y is sufficient for x. 


132 ` PARTITION PROBLEMS [7.4 


3. In the two exercises above it has been possible to choose the fac- 
tor S identically equal to 1. To exhibit a more typical example, let 7, 
xn and y be confined to the positive integers with y'(x) = max 2,, as 
in the preceding exercise, and let 

2x 
i(i + 1) f 


=0, otherwise. 


P(c: | B) = iesi 


` Show that: 
2 n 

(a) P(x | B;) (z ri 5) I By ify <2, 

= 0, otherwise. 

(b) y is sufficient for x. 

4. Put no restriction on the conditional distributions P(x, | B;), ex- 
cept that x, be confined with probability one to some fixed finite set. 
Say, for the moment, that two values x and 2’ of x are team mates, if 
one arises from the other by permutation of the component observa- 
tions. This divides the possible values of x into teams, and, academic 
though it may seem, the team to which x belongs can be taken as y' (x) 
Show that the probability of x given y(x) and B; is independent of t 
(if it is defined at all), so that the statistic y’(x) is sufficient for x. 

If the values of the x,’s happen to be real numbers, then for any ¥ 
it is possible to permute the component observations to obtain a non- 
decreasing sequence of n (not necessarily distinct) numbers, and only 
one such non-decreasing sequence can be so obtained from each 2. 
The sequence thus attached through x to each s is called in statistical 
usage the sequence of order statistics corresponding to x. Since team 
mates, and only team mates, have the same order statistics, the set of 
order statistics regarded as a single statistic is equivalent to the team 
statistic y’(x) defined more generally in the paragraph above and 1s 
therefore sufficient. 

5. Let x, given B; be subject to the normal probability density with 
mean p;, and variance o,”, that is, 


(13) olz, | B) = (2) exp {—(x, — p,)?/202}. 


This situation, though elementary, does not fall within the technical 
scope of this book, because x, is not confined to a finite set of values- 
The reader familiar with probability densities will see, however, that 
the density of x is 

Da," Da w? 
(14) (ey +++ tn | B) = (27) ™!? exp |- SEENTE l 


E 
2 
20? of 20; J 
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which suggests that y, defined by 
(15) y'(x) = {2a,?, Zar}, 


may fairly be called a sufficient statistic for x. 

; Show in the same heuristic way that, if ø; is independent of i, then 
Y (x) = x, defines a sufficient statistic; and that, if u; is independent 
of i, then y'(x) = nDa,2 — (Z2,)” does so. 

6. If w and z are observations independent of each other given B;, 
under what conditions can w be sufficient for {w, Zs 

7. To break away from independent observations, suppose that, in 
the event B,, n cards are dealt from a thoroughly shuffled deck ofa +i 
cards each bearing a different serial number from 1 through n + t. 
Let w, be the number on the rth card dealt and w = {Wi, °**) Wa}. 
Show that max w, defines a sufficient statistic for w, and that the w,’s 


are not independent. 
8. If z extends w, and w is sufficient for y, then z is also sufficient for 


y 
of both z and w, then 


9. If z is sufficient for w, and y is independent 
{z, y} is sufficient for {w, y}- 

10. Every definitive statistic is sufficient. 

In virtually all statistics texts it would be said that the y defined by 
(15) constitutes not one statistic, but two; similarly, the set of order 
Statistics would ordinarily be referred to as ” statistics rather than as 
one. There are contexts in which it is appropriate to try to count sta- 
tistics in that fashion, but, so far as the theory of sufficient statistics 
is concerned, it often seems fruitless, if not positively detrimental, to 


9 So. 

_ The concept of sufficient statistics has proved of great value in sta- 
tistical theory and practice. The reason for this does not seem to me 
altogether easy to analyze, but, as the exercises above illustrate, the 
families of distributions most frequently studied in statistics are gen- 


erally rich in sufficient statistics. It is hard to separate cause from 
effect, here; for the distributions that are most studied tend to be those 
having the greatest mathemati and the presence of strik- 


cal simplicity, i s 
mg sufficient statistics, such as those exhibited by Exercises 1, 2, 3, 5, 
and 7, are among the sources of mathematic 


al simplicity most often 
met in the study of particular famil 


ies of distributions. 
Tt must be emphasized that sufficient 


statistics often provide a signifi- 
rant saving in the mechanical labor of storing and presenting data. 
represen 


hus, in any experiment faithfully 1 ted by Exercise: 2, ats 
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i i ical and ordinary senses of the word, to 
puio & Di ie pedo oe of the list of z,’s, which might well be 
cee aia e of the other exercises would in principle also lead 
pater cates of this sort, but Exercise 5 is the only other that arises 
coat a aise statistics was introduced, together with 
much of the theory associated with it, by R. A. Fisher (cf. index, [F6]). 
The subject has been one of continuing interest and has been explored 


in several directions; key references are [B2], [E1], [L6], [H3], [1X15], 
and [M5]. 


5 Likelihood ratios 


The random variable 6(x) has played so important a role in preced- 
ing sections that the reader will probably not be surprised to find that 


B(x) is a sufficient statistic for x, a conclusion that, in the light of the 
factorability criterion (4.10), can be seen thus: 


P(e| By = Pla) 
w =O 


If a statistic is sufficient, it is sufficient irrespective of the value of 8; 


moreover, any multiple of it by a non-zero constant is also sufficient. 
Therefore, (1) implies that for any numb 


ers a(i), such that a(t) > 0, 
the multiple observation 1() defined by 


r(x; a) = Dt _ P| B) 
(2) iia Za(j) P(x] B;) 


r(t; a) = pe {ry(e, a), T(z, a)} 
is a sufficient statistic for x, Since 


(3) DX alri; a) = 1 
3 
there is some redun 


n dancy in retaining both components, but this re- 
dundaney is more than compensated by the advantage of retaining 
symmetry, especially when n-fold partitions ar 

Formally, the T(a)’ 


for each a; but to all i 
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cient statistic, for any r(«) is equivalent to any other, say r(e’), as can 
be demonstrated thus: 
P(x| B)/Ze'(k)P(« | Be 
(4) r(x, a) = Pca | ee } fel Zò 
Sa(j) [PE | B,)/Ze’(K)P(@| Ba} 


r(x, a’) 


~ Za(sri(a, o’) 


uch a multiplicity of forms for what is essentially one im- 


Having s 
portant statistic is rather embarrassing, so there is some incentive to 


pick a standard form. Setting each a(j) = 1 recommends itself as con- 
venient and leads to the particular statistic r = {11, T2}, where 


P(x| B: 


ient for twofold and, more generally, for n- 
infinite partitions are to be dealt with, its 
g, for the sum in the denominator of 
In the case of twofold partitions, a 


convenient form for the statistic is that of a likelihood ratio, in the 
sense introduced in § 3.6, for it is easy to see that, infinite numbers 
being admitted, P(x | B1)/P@ | Bs) is equivalent to r. Henceforth, any 
statistic equivalent to r will be called a likelihood ratio of x with re- 
spect to the partition Ba definition that does not seriously conflict 
with ordinary statistical usage of the term. 

Figure 1 illustrates geometric interpretation of likelihood ratios 
that is sometimes valuable. The figure can best be described by telling 
how to draw it. First draw a pair of cartesian coordinate axes for varia- 
bles wy and uz. Next draw the two line segments represented by uy + 
us = 1 and (w/a(1)) + (up/a(2)) = 1 with the u;’s non-negative. The 
left ends of these segments are indicated in Figure l bya and b, re- 
spectively, the particular value a = {1/3, 2/3} being used for illustra- 
tion. Now plot the point {P(e | Bı), P@ | B)} If x has positive 


probability (for any, and therefore for all, 8); t! d 
from the origin O, so it will be possible to draw the (dashed) line con- 


necting the origin with the point {P(x | Bi), P@ | Be)}. This line (or 
ray through the origin, as it is often called) must necessarily pierce 
the line segments a and b. The important geometrical fact, which the 
reader will have no difficulty in verifying, is that: these intersections 
occur at the points {71(7); ro(x)} and {r1(%, a), ra(e, a)}, respectively. 


This form is indeed conven 
fold partitions, but, where 
apparent naturalness is misleadin 
(5) is then typically divergent. 
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- 


_-° (PCB), P| By)} 


z 


Figure 1 


It is also obvious that the ratio P(x |B 
the slope of the ray. 

Since, to each æ that occurs with positive probability, 
sponds a ray through the origin, the ray can be taken as a statistic; 
according to the geometrical construction of the preceding paragraph, 


this statistic is equivalent to r and is therefore a likelihood ratio of x 
with respect to the partition B;. 


1)/P (e | Bz) is the reciprocal of 


there corre- 
ug} can conveniently 


though, of course, dif- 


i : in extension of these geo- 
metric concepts to cartesian sp which is necessary 
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in connection with n-fold partitions. In homogeneous coordinates the 
likelihood ratio can conveniently be represented by any of the equally 
good sets of homogeneous coordinates, P(x | B,):P(x | B2), 71(z):72(2), 
and 7(x, @):re(«, a). Finally, it may be remarked that P(x | By)/ 
P(« | B2) is a non-homogeneous coordinate. Thus the many equivalent 
forms in which the likelihood ratio statistics can be naturally expressed 
corresponds to the many different notations by which a ray through the 
origin can be naturally designated. 

The most remarkable fact about the likelihood ratio considered as a 
statistic is that it is necessary, so to speak, as well as sufficient. By that 
I mean that to have the advantages of knowing x it is necessary as 
well as sufficient to know the likelihood ratio. The point can be put 


formally thus: 


THEOREM 1 If y is sufficient for x, then y is an extension of r. 
Proor. The theorem is virtually obvious in terms of the factora- 
bility criterion for sufficient statistics, for in the notation of (4.10) 
Ry), 9 
6 (2) = 
< P ER(y(2), J) 


with probability one, exhibiting r; as & function of y. ® 


Coronary 1 If z is sufficient for x, and if every y sufficient for x 
is an extension of z, then z is equivalent to T. ' 

By ordinary analytic standards, the likelihood ratio seems to be a 
rather complicated statistic, at least in the case of n-fold partitions, 
where n is at all large; for, to one who takes seriously the idea that a 
multiple statistic should not also be regarded as a single statistic, the 
likelihood ratio seems at first sight to be ”, or perhaps (n — 1), statis- 
tics. Yet Theorem 1 and its corollary show that the likelihood ratio is, 
in a fundamental sense, the most compact sufficient statistic that a 


partition problem admits. 

As an explicit example o 
tition problem arising from 
different values of p, say Pı and pe. 
computed thus: 


f a likelihood ratio, consider the twofold par- 
Exercise 4.1 on confining attention to two 
The likelihood ratio r is easily 


(7) P(e |B) = wd = Pa” 
y'(z) NY 
5 a (*) 
= — p)” | —— = i ’ 
PUn G = a] Gi 
8S9 


(8) Be) = Eg" (p:/0)" O j 


138 PARTITION PROBLEMS [7.5 


Theorem 1 is thereby verified in the present instance; for (8) exhibits 
r explicitly as a contraction of y, and y is easily exhibited as a contrac- 


tion of r thus: Fe 

T(t) /q2 | 

log oH Í 
ro(t) \qı 

(9) ue, m ea 


P192 
log — 
Pon 


In this example, y is, in view of (8) and (9), equivalent to the likelihood 
ratio. 


Exercises 


1. Express k(@(x)) and v(F(x)) in terms of the likelihood ratio thus: 


(10) Bi; 7) = vr rB)/D r60), 
(11) k(B(@)) = k(6(r(2))). 
(12) v(F() | 8) = E ket) ke Pt] baeo]: 


2. This extended exercise develops the personalistice and behavioral- 


istic theory of what, following the objectivistic and verbalistie tradi- 
tions of statistics, is called the testing of a simple dichotomy, a type of 
decision problem that, though seldom very realistic, is a popular and 
instructive example with important implications for more realistic prob- 
lems. Verbalistically such a problem is described as that of making the 
best guess on the basis of an observation as to whether it is By or Bo 
that obtains. Behavioralistically, this is generally interpreted as the 
problem of deciding, on the basis of observation, between two primary 
acts one of which is preferable to th 

if Bə does. Here is one topic in whi 


I simply a pedagogical simplification; 
a reader interested in relaxing the assumption will find pages 127-130 
of [W3] stimulating. 


Suppose that F contains only tivo acts fı and f; and is dominated by 
neither. Let Qij = vs E(f: | B;). 


(a) There is no loss of generality in supposing 
(13) ôi =pf m t >0, è To me 
2 2 : 
which will henceforth be done. Tha 


r t is, it will be supposed that fı is 
appropriate only to B, and vice vers: 


a. 


7.5] LIKELIHOOD RATIOS 139 
(b) Show that 
(14) ke) = © bubl) for bC) = 61/(61 + 82) = Bo) 
J 
= DY $2;8(3) for B(2) > ô2/(ô1 + 52) = Bo(2) 
á 


=43(d1 + $21)8() + 3(d12 + $22)8(2) + | 6:8(2) — ô28(1) 
= E b0) + | 4:82) — 4280) |, 
i 
where 6o and the e/s are defined by the context. i = 
(c) E(;| 8) = (8), if and only if 60) = Bo(2). This condition ob- 


tains for both 7’s simultaneously, if and only if 6 = Bo. 
(d) Show that 


(15) k(B(r)) = {= ejri8(i) + | 61728(2) — 82180) I= 798(9) 


= Deer) for 2 *B, Bo), 
a 


where 3 

Bo(t)/BO 
(16) 7i*(B, Bo) = Di EDO j 
and that 


ll 


(7) oF) | 8) L e) + El HPC] BDB) — 8PC | BDU) | 


J 


+ bol — 2P(r1 < 71*@, Bo) | Bi) 
— P(r = r*(8, Bo) | BDB 
4 fe + dill — 2P(r2 < 72*(B, Bo) | Bs) 
— P(r = r*(8, Bo) | Bs) 1}8(2)- 


fe 


(e) Any derived act f(x) determines & function i aarne an i to 

each z, i being implicitly defined thus: z Ae e rA fis i 

determines a derived act. Show that x SOE) | Bes 

only if rig (£) = rie *(Bs Bo) for every *. Such a function 7(x) is ee 

a likelihood-ratio test associated with r*. Show thet at san m m e 4 

hood-ratio test is associated with every a of r*, and that if P(r = r*) 

i one. 

= 0 (which is typically the case) there is only — F 
(£) p ex) A Letina by a function of i, the probability of age 

on the inappropriate value of i in case B; obtains is generally called 
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the probability of an error of the j-th kind. Analytically the probabili- 
ties of error of the first and second kind are, respectively, 
(18) e =p Pz) =2|Bi), ez =p P(i(x) = 1 | Bo). 
If i* is a likelihood-ratio test associated with r*, show that its errors 
of the first and second kind are subject to the bounds 
(19) Pran" | By) < a* < P(r, < ri*| By) 
(20) P(r, > r1*| Bo) < e < P(ry > r,* | Be). 
What about the typical case that P(r = r*) = 0? 


(g) Show that, if i is at least as good as i* in the sense that e; < e,* 
for both ’s, then i is a likelihood-ratio test a: 
ei = e;* for both 7’s. 
= r* 


nd i is virtually i* in that 
Hint: Consider an F and a B for which r*(8, Bo) 
, Showing that these exist, and note that, for this decision problem i 


E(fi«| 8) = fe — ô(1 — 2e1*)}8(1) + {e2 — 51(1 — 2e2*)}6(2) 
= o(F 
a o(F(x) | 8) 


EE | B) = {er — d9(1 — 2e1)}0(1) + fe — 51(1 — 2ee)} (2) 
> (F(x) | 8), 


with equality if and only ifiis a likelihood-ratio test. 
This important conclusion about likeliho 


od-ratio tests has been much 
emphasized, especially by the N 


eyman-Pearson school. 

The concept of likelihood ratio, 
is now one of the most pervasive 
seems to have been introduced in 
[F3]), who emphasized it in connec: 


sometimes simply called likelihood, 
concepts of statistical theory. It 
1922 by R. A. Fisher (cf. index of 


i tion with the important method of 
estimation named by him “the method of maximum likelihood.” Its 


use in testing hypotheses was apparently first emphasized by J. Ney- 
man and E. S. Pearson (see Vol. II, p. 303 of [K2]). In connection with 
likelihood ratios as necessary and sufficient statistics, mathematically 
advanced readers will be interested in Section 6 of [L6], [B2], and 
[M5]. One of the earliest contributions in this direction was made by 
C. A. B. Smith [S14]. 


6 Repeated observations 


If x(n) = x, sets Xl WD rs are independent 
identically distributed random variables, then »(F(x(n))) is a non-de- 


creasing function of n, for the (n + 1)-tuple is an extension of the n- 
tuple. If k(8) is strictly convex—a condition that you now recognize 


ere, given B;, the x 
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as interesting—(F(x(n))) is easily seen to be strictly increasing in n, 
unless the individual x,’s are either utterly irrelevant or definitive. 

Tt is to be expected, especially in the light of the approach to certainty 
discussed in § 3.6, that, as n becomes very large, x(n) will become prac- 
tically definitive. Indeed, § 3.6 makes it possible to state and prove a 
formal theorem to that effect. 


THEOREM 1 
Hyr. 1. x(n) = {x1, «++, Xn}, Where, given By, the x,’s are inde- 
pendent and identically distributed random variables. 
2. The x,’s are not utterly irrelevant to By. 
3. o(F | 6) = k8). 
Conc. lim _v(F(x(n)) | 8) = U8) =p: EOU, 0) + BKO, 1) 


n= w 


uniformly in £. 
Proor. Writing x as short for x(n), 
@) o(F(x) | 8) = Eke). 
closed interval Z on which k is defined 


For an arbitrary e > 0, let the 
ae : J and K, where J is the set of those 


be partitioned into two subsets 
B’s such that 
(2) ræ) 2 U8) — $ 
and K is the complement of J relative tol. r 
It follows from the continuity of the functions on each side of (2) 


that 8 eJ, if either component of 8 is sufficiently large. 
The computation initiated in (1) can now be carried forward thus: 


(3) ERER) = ERG) | LEl) IPEE) eJ) 
+ EKEE)) | 8E) « KIPEE) £) 
> BEE) | LEl) IPEE) ©) 
+ min k(8')-P(B@(S)) eK)— e 
= EEEN] — (EIEH) | LEG) KI 
— min k(8)}P(B(a(s)) ¢K) — « 


> 1(8) — max | k(6’) | PEE) aK) TE 
ee A 


Now, in view of the paragraph in which (3.6.15) occurs and the fact 
that, if either component of £ is close to 1, B eJ; P(B(2(8)) £ K) becomes 
arbitrarily small for sufficiently large n. ® 
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7 Sequential probability ratio procedures 


The present section digresses to discuss an interesting application of 
the ideas presented in this chapter to what is called sequential analysis. 
Sequential analysis refers in principle to the theory of observational pro- 
grams in which the selection of what observations to make in later 
phases of the program depends on what has been observed in earlier 
phases. Such behavior is commonplace in everyday life; for example, 
you look for something until you find it, but not longer. Statistics it- 
self has always used sequential procedures. For example, it is not rare 
to conduct a preliminary experiment to determine how a main experi- 
ment should be carried out. Thus, if one were required to estimate 
with a roughly preassigned precision the mean of a normal distribution 
of unknown mean and unknown variance, one might reasonably begin 
by taking ten or twenty observations, which would give some idea of 
the variance and would therefore determine about how many observa- 
tions are necessary for achieving the requisite precision. 

Commonplace though problems with sequential features are, A. Wald 
was the first to develop (1943) a systematic theory of a considerable 
body of problems of this sort. For early history see the Introduction 
of [W2] and the Foreword of Section I of [517]. 

Some later ideas on sequential analysis, due mainly to Wald and 
Wolfowitz, are the subject of this section. It will not be practical to 
proceed with full rigor, primarily because random variables capable of 
assuming an infinite number of values are necessarily involved. Full 
details are given in [W3] and more compactly in [A7], but not in Wald’s 
book on sequential analysis [W2]. 

Let x = {x(1), +++, x(v), +++}, where the x 
infinite sequence of independent, relevant, 
dom variables. Rather informally, a Sequential observational program 
with respect to x is a rule telling whether to observe x(1) or whether to 
make no observation at all; if the particular value 2(1) is observed, 
whether to observe x(2) or to discontinue observation; if the values 
(1) and z(2) are observed whether to observe x(3) or to discontinue 
observation, ete. 

More formally, let N be a function of the infi 
x = {x(1), +++, x(2), ‘++} such that, 
every component from the first throu. 
Such a function N determines a 
which is a contraction of x, call it y 


(1) 


(v)’s are conditionally an 
identically distributed ran- 


nite sequence of values 
if the sequence x’ agrees with x in 
gh the N(x)th, then N(x’) = N(a). 
sequential observational program, 
(x; N), defined thus: 


y(x; N) =p, {x(1), ---, x(N(x))}. 
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It is to be understood that, if N(x) is zero for some 2, it is identically 
zero, and that y(x; 0) is a null observation. 

It will be assumed that the random cost associated with a sequential 
observational program is proportional to the number of random varia- 
bles observed, that is, c = N(x)v, y > 0. No categorical defense of 
this assumption is suggested, but clearly there are interesting problems 
in which it is met at least approximately. The domain of applicability 
of the theory can actually be considerably extended by modifying the 
assumption to include a fixed overhead cost that applies except in case 
N is identically zero; this does not greatly complicate the analysis, as 
the interested reader will be able to see for himself. The theory would 
even remain virtually unchanged, if ¢ were only assumed to be of the 


form 
N(@) 
(2) c=h+ Dee, #N>D, 
v=1 
=0, if N =0, 
where h, c(1), c(2), =+- are independent with finite expected values 


d the c(v)’s are identically distributed. 

For any F there are some values of 6 for which it would be unwise to 
adopt any sequential observational program other than the null obser- 
vation. Suppose, for example, that £ is so close to an extreme value 
that 1(@) — k(8) < y; under this circumstance the most that could be 
gained by observing even X itself would be less than y, but the cost of 
making so much as one observation is at least Y- Let the set of values 


of B for which it is not justified to make any but the null observation be 
denoted for a while by J(F; 7), or simply J, for short. 
the definition of J, be maxi- 


Now, if 8 eJ, the person’s utility can, by 
mized by refraining from any observation but the null observation and 
1] be some advantage to 


accepting the utility (6); otherwise there wi l 
him in observing x(1). If the person does observe the particular value 


æ(1) of x(1), he finds himself with a posteriori probabilities B(x(1)) in 
place of the a priori 8, he has paid (or at any rate entailed) a cost y, 
and he must now decide whether to make any further observations. 
His new problem is simply the problem he would have faced at the out- 
set had hig a priori probabilities been a(x(1)) instead of B, except that 
all utilities are now reduced by y. He justifiably accepts the utility 
k(B(@(1))) — v, if B@()) 7; ot ex(2). Continu- 


herwise he will observ 
ing this line of argument step after step, it follows that optimal action 
x(v)’s until a 


consists in observing successive n a posteriori probability 
in J occurs, and then adopting a basic act consistent with the a posteriori 


probability. 


E(h) > 0, E(c(r)) > 0, an 
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In actual practice, it is far from easy to determine whether a particu- 
lar value of £ belongs to J (F; y), because in principle the whole enormous 
variety of sequential observational programs has to be explored to de- 
termine whether any one of them has a derived value greater than k(8). 
The practical advantage achieved in the preceding paragraph is that 
of greatly restricting the class of programs that merit consideration. 
Thus the problem of determining whether 8 e J (F; y) does not require 
a survey of all observational programs, but only of those defined in 
terms of some set J’ according to the rule that N(x) is the first integer 
for which B(x(1), «++, 2(n)) aJ’. 

If programs corresponding to all sets J’ had to be examined, the 
process would still be mathematically impractical ; indeed, in all but 
special cases, practical solutions have yet to be found. 
special conditions that J must necessarily satisfy 
sets J’ satisfying those conditions need be examined. Some very gen- 
eral conditions are these: J contains the extreme 


points of I; J is topo- 
logically closed, that is, if a value Bo is not in J, t 


hen the near neighbors 
of Bo are also not in J. The first of these conditions requires no com- 


But, if any 
are discovered, only 


ment, and the second follows easily from the continuity as a function of 
B of 
68) EUk(By(x; N))) — yN | 6] — k(6). 


These conditions alone do not go far toward narrowing to practical 
limits the variety of sets to be explored. Thus far in the development 
of the subject, really powerful conditions have been obtained only at 


the expense of considerable restrictions on the structure of F or, equiv- 
alently, of k. 


Suppose, then, that F is dominated by a finite number of acts or, 


the graph of k is polygonal, as it is 
Technically, this restriction on k may 
nterval I is the union of a finite num- 
ber of intervals of linearity of k. Under the restriction, relatively much 
of J(F; y), for it is true in general, 
ph, that the intersection of J with 
sed interval. 


(4) È Elh | BAB) > kleo), 
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for h is supposed to be advantageous at So; and 


(5) E Elh | B)Bn@) < Kn), m= 1 2, 


act is supposed to be advantageous at Bm, since Bm €J. 
and since k(8) is 
(4) and (5) that 


for no derived 
Since Bp is a weighted average, say D¥mB8m; of the Bn’s, 
linear in the interval between 6; and Be, it follows from 


(6) E (h | BeBoli) < kleo), 


contradicting (4). The supposition that Bo € ~J has thus been re- 


duced to absurdity. 
The demonstration just given extends directly to n-fold problems. 


The general conclusion is that the intersection of J with any domain 
of linearity of k is convex, SO that, if k is polyhedral, J is the union of a 
finite number of closed convex sets, each lying wholly in a domain of 
linearity of k. The practical implications of the conclusion are enor- 
mously greater for twofold than for higher-fold problems, because 
twofold problems lead to one-dimensional bounded, closed, convex 
sets, which present no great variety, all of them being closed bounded 
intervals, But threefold problems, for example, lead to closed bounded 
two-dimensional convex sets, ‘ction that leaves great room for 


a restri 
variety. p 
If k is polygonal, the variety of sets J’ to be surveyed is enormously 
f a known number of intervals, each 
Suppose that this number is 


grams to be surveyed can 


reduced, for J’ must be the union 0 
of which is confined to a known interval. 
m; the class of sequential observational pro 5 
be characterized by the two end points of each of the m intervals, ex- 
cept that the possibility that some of the intervals are vacuous must be 
borne in mind. Since the extremes of I are necessarily in J, and there- 
fore necessarily appear as end points of intervals in J, the exploration 
has been reduced to a 2(m — 1) parameter family of possibilities. 

The possibility that m = 1, which almost means ihai Fis paca 
by a single element of itself, is trivial; for then all 6's are in J, and ob- 
servation is never called for. This can be seen m many Ways. In par- 
ticular, it follows as an illustration of the machinery that has just been 
developed, thus: The end points, or extremes, of I are em ; 4 , as al- 
ways, and, since m = 1, they are both in the same hinya o ae 
of J; therefore the interval between them, namely every value 0 b, 
cre statistical usage, the se- 


The possibility that m = 2—in ordinary € : 
quential testing of a simple dichotomy 8 of particular importance. 
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It occurs typically when F is dominated by two acts, neither of which 
dominates the other, as in Exercise 5.2. One of the two acts is approp- 
riate to one “hypothesis” B,, and the other is appropriate to By. In 
case m = 2, it is easily seen, by methods that have now been indicated 
more than once, that each of the two closed intervals that constitute J 
has as one end point one of the extremes of J. Neither of the tw 
vals can be vacuous, nor can either consist only of a single point. It is 
relatively easy to find, at least approximately, the two values of B that 
determine J(F; y), and the theory of this situation has correspondingly 
been brought to a relatively high degree of perfection; for details, see 
[S17], [W2], [W3], and [A7]. 

Following (or at least paraphrasin; 
vational program characterized by 


o inter- 


g) Wald [W2], a sequential obser- 


making successive observations un- 
til the a posteriori probabilities fall into some set J, followed by adopt- 


ing a basic act appropriate to the a posteriori probability, is called a 
sequential probability ratio procedure. The reason for this nomencla- 


ture is that to observe until the a posteriori probabilities fall into J is 
to observe until the numbers 


Taa ~ LOPEN), ---, z) | B) 
(7) BG| 20), ++, x@)) = Seem, ~, 2] Bp 


lie in a certain set, or, what amount: 
conditions. But, the particular 
is tantamount to requiring the r: 


S to the same thing, satisfy certain 
value of 8 having been assigned, this 
atios of probabilities 


(8) P@Q), +++, 2(N) | B) 
P@M), +++, 2(V) | B) 


to satisfy certain conditions. 

Since (7) and (8) are ways of expressing the likelihood ratio, the ob- 
servational program together with the act derived from it might also 
be referred to as a sequential likelihood-ratio procedure. Indeed, but 
for the precedent established by Wald, that would seem the better 
name. 

As an actual example of a sequential probability ratio procedure, 
suppose that the distribution of x(v) given B; attaches the probabilities 
pi and qi = 1 — p; 


i to the values l and 0, respectively. The expression 
(8) can in any case be written in the factored form 


~ [Peo |B) 
9 SE 
©) I rem | ik 


v=] 
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and in the present example this takes the special form 


(10) ere _ een 
42 q2/ \P2fi : 


Pe. 


where 


N 
(11) y(N) = S z(o). 
v=1 


It is noteworthy, in connection with sufficient statistics, that the con- 
dition that the a posteriori probability be in J is in this case expressible, 
according to (10), as a condition on y(N) and N. Specializing the ex- 
ample further, suppose that J is of the sort appropriate to testing a 
simple dichotomy. The condition that the a posteriori probability be 
w ~J is then expressed by each of the following equivalent pairs of 
inequalities, where a; and a2 are positive numbers such that a, + a2 


<1. 
(12) B(L | aL) ** *, x(N)) <1- a(1), 
a(2| 2(1), +++) #(N)) <1 - a). 
i he a(1), 
(13) ao + 82) 


aes ae < 1 — &(2), 
B(1)Q + B(2) 
where Q for the moment denotes the likeli 
BDA — el) _ Qe 
B(1)a(1) 


hood ratio (10). 


(14) 
BDA) _ a, 


> Ga@a — a) 


where Q*, Qx are defined by the context. Since, according to (13), the 
structure of ~J is superficially determined by three parameters, say 
by 61, a1, and az, it is worthy of some note that the corresponding con- 


dition is ultimately expressed in terms of only two special parameters, 
sidering that ~J is an open interval 


Q* and Qx; this is only natural, con: 
determined by its two end points. The act that would be appropriate 
to B, is called for by values of Q 2 Q*, and the one appropriate to Be 
is called for by values of Q < Q». 
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Thus far, the particular form (10) of the likelihood ratio has not 
really been exploited in the calculation, so (14) applies to the testing of 
simple dichotomies generally. Taking account of (10), (14) can by ele- 
mentary manipulation be put in the following form. 


y(N) < {log Q* + N log (g2/91)}/log (p1q2/pom), 


YN) > {log Qx + N log (g2/q1)}/log (p192/p2q1), 


(15) 


where, for definiteness, it is supposed that Pı > p2. Thus, the region 
in the (N, y) plane determined by ~J, the region in which further ob- 


servations are called for, is a band bounded by two parallel lines of 
positive slope. 


8 Standard form, and absolute comparison between observations 


If x and y are such that, for every F and £, v(F(x) | 8) > oy) | 8); 
then x imitates, so to speak, an extension of y, and it may appropriately 
be said that x is a virtual extension of y. Correspondingly, if x is a vir- 
tual extension of y, and y is a virtual extension of x, it may be said that 
x and y are virtually equivalent, 

No matter what a priori 
basic acts are available to | 
pair of virtually equivalent 
vations are indeed equivalen: 
binations of observations a 


nt obser- 
Where com- 
er, the rela- 
alence. For 


example, if x and y are equi alent to the mul- 


tiple observation {x, y}, but if x and y are only vir 

they may well be indepei 

equivalent to {x, y}. 
This section explores the noti 

equivalence. In particular, 

the class of observations vi 


ions of virtual extension and virtual 
an interesting standard representative of 
rtually equivalent to a given observation x 


easily be able to concentrate 
more understandable. 
Most of the ideas to be presented in this section were origi 
ted b; 
H. F. Bobnenblust, L. S. Sh: a aN 


apley, and $. Sherman in a private memo- 
randum dated August 1949, which I was privileged to see at that time. 
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This work was extended and brought to the attention of the public by 
David Blackwell in [B16]. 

It is obvious that, if y is a sufficient statistic for x, then x and y are 
virtually equivalent. In particular the likelihood ratio r derived from 
x is virtually equivalent to x. Moreover, the reader may anticipate, and 
it will be formally shown in the course of this section, that if and only 
if observations are virtually equivalent do their likelihood ratios have 
the same distribution for every value of 6, or, what comes to the same 
thing, given each B; i = 1, «++, n. Thus the conditional distribu- 
tions of the likelihood ratio given each B; could be taken to characterize 
the observations virtually equivalent to a given one, say x. Actually, 
as will be shown, the class of observations virtually equivalent to x can 
be represented by the distribution of the likelihood ratio for any single 
non-extreme value of 8. For definiteness, the particular value p* = 
{1/n, -+-, 1/n} will be used, but the interested reader will find it a 
considerations based on 6* to any 


simple exercise to extend all the l 
sion of the theory 


other non-extreme £, as would be necessary in any exten; 
to infinite partitions. 

Let m(r) be the probability 
form (5.5) attains the particu 
dent abbreviations, 

(1) m(r) = P(r | 6*) 
= > P(r | B)A/y) 
t 


that the likelihood ratio in the standard 
lar value r when 6 = £*. With self-evi- 


Lilo & Pel Bs: 
Nn j r(z)=r 
The second line of (1) exhibits m(r) expressed in terms of m n a 
butions P(r | B;). It is rather more interesting to see that tl ar ae 
tributions can themselves all be expressed in terms ae ey i 
tribution m, as follows from the definition (5.5) of r and the third line 


of (1) thus: 
g pelB) = £ Pel Bd 
r(z)=r 
= Z rte) D P| B) 
r(z)=r j 
= nrm(r). 
Similarly, 


(8) po |) =” {> nso} mf). 


i 
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Regarded as a probability measure on the set of all n-tuples of num- 
bers 7, m has the following three important properties. 


P(r; > 0|m) =1; 
(4) P(X tlm) = 1; 


Elt; | m) = n-, 


Of these, the first two are obvious from the definiti 
follows by calculation from (2) thus: 


(5) Ia D P(r | B) =n bD rim(r) 


on of r, and the third 


= nE(x; | m). 


Conversely, suppose that m is any mathematical probability defined 


on the set of n-tuples r of numbers, subject to the conditions (4), then, 
as can easily be verified, n mathematical probabilities are formally 
defined by the equation P(r| B 


i) = nrm(r). Mathematically, r dis- 
tributed thus can be regarded as 


an observation. The following calcu- 
lation demonstrates the expected concl 


usion that the likelihood ratio 
of this observation is the observation itself and that its distribution 
given B* is m. 


P(r | B;) _ nrm) Ea 
D P(r | B;) on bs rym(r) oe 
” P(r| 8%) = E nrgm(r)(1/n) = m(r). 
I 


It is interesting and fruitful to compute v(F(x) | 8) in terms of m. 


(7) (FR) | 6) = E&G(R)) | 8) 
= Eikli E 138(9)}) | gy] 
J 
= MB KULO E ODE r60) | ml. 
j F) 


Temporarily adopt the convi 


ention that, if æ is any n-tuple of positive 
numbers and h any function of r (not ni 


ecessarily convex), T(a)h is a 
function of r defined thus: 7 convas), Plaji i 


(8) T(a)h(r) =p K {rie(i)/ > 


Then (7) takes the abbreviated form : 
(9) 


739) }) Zrja(j). 


EEEE) | 6) = nET Ok) | m). 
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To see the implications of (9), it is necessary to know something about 
what the operation 7'(8) does to the function k, in particular to know 
that T(8)k is convex in r. The derivation of these necessary facts is 
straightforward and is left to the reader as a sequence of exercises. 


Exercises 
la. T(a)T(ayh = T({a(1)B(L), +++, e@)8(m)})B = T(8)T(a)h. 
1b. h = T(fa(1)7, «++, er) *})T(@)h. 


1 
2. T(8*)h = —h. 
n 
3. If h(r) > g(r) for r between 7 and r”; then T(a)h(r) > T(a)g(r) 
for r between r'a(i)/ Z tj'a(j) and r;”a(i)/ È rjal). 
j j 
4. If h is linear, then so is T(a)h. 


5. If h is convex (strictly convex), then so is T(a)h. 
Exercise 5 is obvious in the light of Exercises 3 and 4, but some may 


prefer the demonstration suggested by the following calculation, where 
à+ u= 1; p > 0; and obvious abbreviations are used. 


0) TlahAr + ur’) 
ha-? r por’ 7 ) 
os fil ee eet r’) 
rman T P 


r r n 
<ni (Za) wrt ah — ajar 
ar arr 


= AT (a)h(r) + wT ahr’). 
ablish once more that observation 
and Exercises 5 and 2. 


It is amusing to est generally pays, 
this time by means of (10), (4), 
an nE(T(B)k() |m) = nT(OkEC | m)) 

= nT (B)k(8*) 
= k(). 

If x and x’ are observations and m and m’ are the corresponding dis- 
tributions, it is now easy to say in terms of m and m’ when x is utterly 
irrelevant, when it is definitive, and when xis virtually an extension of x’. 


More exercises 
6. The observation x is utter! 
=i p 
7. The observation x is definitive; if and only if P ath | m) = 1/n, 
or, equivalently, if and only if P(r; = 0 m) = (n — 1)/n. 


ly irrelevant if and only if P(r = pt | m) 
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8a. The observation x is a virtual extension of x’, if and only if, for 
every convex function h defined for r, 


(12) Ehl) | m) > Ehl) | m’). 


8b. The two observations are virtually equivalent, if and only if, for 
every convex function h, 


(13) E(h() | m) = E(h(r) | m’). 


The conclusion reached in Exercise 8b can be much improved. In- 
deed, it will be shown that the two observations are virtually equiva- 
lent, if and only if m and m’ are the same probability measures. This 
will be achieved if, for example, it is shown that m and m’ have the 
same moments, for it is well known that two different countably addi- 
tive probability measures confined to a bounded set of n-tuples of num- 
bers cannot have the same moments.f The moments in question are 
expected values of monomials of the form 


(14) g(r) = ri ra +. 7 


where the es are non-negative integers. 

convex, so it cannot be concluded immedia‘ 
expected value with respect to m and m. 

function is added to g, then the sum wil 
value will be the same with respect to m a 
this is also true of the convex term of th 
the not necessarily convex term. Specifi 


(15) 


? 


In general, g will not be 
tely that g has the same 
If, however, a highly convex 
l be convex and its expected 
ndm’. Since, by hypothesis, 
€ sum, it must also be true of 
cally, let 


AG) = 90) +0 Do re, 
i 
where À is a positive number t 


; o be determined later. To test h for con- 
vexity, let s be for the moment an arbitrary n-tuple of numbers and ø 
a real variable, and compute th 


the second derivate of h(r + os) with re- 
spect to o at o = 0, 
Ph(r + os) a?g(r) 
(16) —— E g 2 
do? e=0 ùj OF; Or; He tA an 


J 
is between 0 and 1, the absolute values of the 
common upper bound, say 
f See, for example, Corollary 1.1, p. 11, of [513]. 

Under our usual simplifying assumption that x is confine 
values, m is certainly countably additive. 
veloped mutatis mutan 
additive on some suitab 


ed to a finite number of 
the whole theory can be de- 
istribution of x is countably 
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#3 So, if A > pn2, h is convex in the region where each 7; lies between 0 
and 1 and is a fortiori convex in the intersection of that region with 
the hyperplane Zr; = 1. 

Now that it has been established that m and m’ represent virtually 
equivalent observations, if and only if m and m’ are identical, it is ap- 
parent that m—or, more exactly, the set of conditional distributions 
P(r| B) = nrm(r)—is a unique standard form for all observations 
virtually equivalent to x. 

If x virtually extends y, it is to be expected that, no matter what rea- 
sonable definition of “informative” may be suggested, x will be at least 
as informative as y. In particular, it is to be expected that the infor- 
mation of B; with respect to B; (as defined in § 3.6) will be at least as 
large for x as for y, which the following calculation verifies, supposing 
for simplicity that, for both observations, infinite information is im- 
possible. The point in question depends on the convexity of the func- 
tion h defined by 


(17) h(r) = ri(log rs — log 75), 
because 
(18) T;,; = Blog ri — log tj | Bi) 


= nE[r:(log r; — log rj) | m]. 
e demonstrated much as it was in (15) 


The required convexity can bi 
F tarily called h: 


for a different function also momen 


27 (9 2 ahr) 
a ahl) a oB) o a OAN R 
CAN a E 838 s, 
(19) a2 ne + os) on ae si ane, a ae as 
s?  2s;Sj , TiS) 
= — aes ae ‘ 
Ti rj Tj 
1 2 
= —; ("jsi 7 rj) 2 0. 
Titi 


It would be interesting to know whether every virtual extension is 
realized by an actual extension, that is, whether oo. a A va 
tual extension of y there exist random variables A iy A RN and 
ip x are uiy Bun 7 = ae ce ea he thus 
X ex r best of my Know: 

far aier T, in the us of twofold proploros, the ideon 
stration for that case being given by Blackwell in [B16]. 
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Statistics Proper 


1 Introduction 


I think any professional statistician, w 
in sympathy with the preceding chapter 
ing for the abstractness expected in a bi 
ters do not really discuss his profession. 
same shortcoming in this and the succeed 
cerned with what seems to me to be sta 
the present short chapter is to expl 
general introduction to its successo; 


hether or not he found himself 
's, would feel that, even allow- 
ook on foundations, those chap- 
He would not, I hope, find the 
ling chapters, for they are con- 
tistics proper. The purpose of 
ain this transition and to serve as a 
rs. 

2 What is statistics proper? 


So far as I can see, the feature peculiar to m 


odern statistical activity 
is its effort to combat two j 


special problems th 
a decision, 


nt of view, 


at arise from more than 
one person participating in 
From the personalistice poi 
defined as the art of deali 
difference in decision situa 
tion is justified, later secti 
reader to judge. 
are the conce 
book. 


I will not try to discus 
may profitably be said her 


statistics proper can perhaps be 
gueness and with interpersonal 
Whether this very tentative defini- 
d chapters will permit the statistical 
agueness and interpersonal difference 
indirectly, dominate the rest of this 


tions. 

ons ani 
At any rate, v 
pts that, directly or 


S vagueness in this chapter, 
e about interpersonal difference 
3 Multipersonal problems 


but something 
es, 
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terms of personal probability. This is a view that can best be defended 
by illustration, and the requisite illustrations will be scattered through- 
out later chapters; but some support is lent to it by those critics of 
personal probability who say that personal probability is inadequate 
because it applies only to individual people, whereas the methods of 
science are, more or less by definition, those methods that are accepta- 
ble to all rational people. 

The sort of multipersonal problems I mean to call attention to are 
those arising out of differences of taste and judgment, as opposed to 
those, so familiar in economics, arising out of conflicting interests. Asa 
matter of fact, the latter type of multipersonal situation can, if one 
chooses, be regarded as among the former; it may, for example, be 
said that you and I have different tastes for the process of taking a dol- 
lar from me and giving it to you. ; 

Though modern statisticians do not at all deny the existence of dif- 
ferent tastes in different people, only occasionally do they take that 
difference explicitly into account. In particular, the theory of utility 
has scarcely ever entered explicitly into the works of statisticians. Our 
intellectual ancestors who believed in the principles of mathematical 
expectation were less tolerant than modern statisticians in so far as 
they denied rationality in those whose tastes departed from that prin- 
ciple, and some of their bigotry is occasionally met with today. 

In dealing with multipersonal situations, it is clearly valuable to 
recognize those in which the people involved may all reasonably be 
expected to have the same tastes, that is, utilities, with respect to the 
alternatives involved in the situation. Explicit attempts to discover 
general circumstances under which people’s tastes will be identical are 
rare. The most important and fruitful attempt of this sort is ap 
sented by D. Bernoulli’s idea that utility functions will ama y be 
approximately linear within sufficiently confined — o mme 
Consciously or unconsciously, that principle is repeated y Eh d the 
throughout statistics; it was, for example, brought out m § a a a 
very idea of an observation depends for its practical value on ernoul's 


Principle of approximate linearity. Aon ‘ 

Relatively mepit exploitations of similarity of tasie ag 
made in statistics. The idea is often expressed, rie ne k ei 
Penalty for making an estimate discrepant from t AEA S aia 
mated will, for everyone concerned, be prop ne ent for this prin- 
able range) to the square of the diserepancy ; 20 i a All ba givons ti 
ciple as a rule of thumb appropriate to maby. pontes d that > en- 
$ 15.5. Again, there are situations in which it is agree p 


alty will depend only on the discrepancy and not on the true value of 
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the number to be estimated. Of course, there are problems in which 
both rules are invoked simultaneously, the penalty being supposed to 
be proportional to the square of the discrepancy and independent of 
the value to be estimated. . 
Turn now to differences in judgment, that is, to differences in the 
personal probability, for different people, of the same event. Though 
modern objectivistic statisticians may recognize the existence of dif- 
ferences of judgment, they argue in theoretical discussions that statis- 
tics must be pursued without reference to the existence of those differ- 
ences, indeed without reference to judgment at all, in order that con- 
clusions shall have scientific, or general, validity. To put the same 
idea in personalistice terms, I would say that statistics is largely devoted 
to exploiting similarities in the judgments of cert: 
and in seeking devices, not 
imize their differences. 


The tendency of observation to bring about agreement has been il- 
lustrated in §3.6. Some of the other general circumstances in which 
different people may be expected to agree, or at least nearly 
some of their judgments have also been mentioned. For example, it 
may well happen that different people are faced with partition prob- 


lems that are the same in that the same variable is to be observed by 
each person, but differ in tha 


each person has his own a priori proba- 
bilities 8 and his own set of available acts F. If, however, the condi- 
tional distribution of x given B; is the same for each person, then the 
people will, for example, agr 


ee as to whether a contraction y of x is 
sufficient, which is often of great practical value, 
cumstances under which each of these same peopl 


tain derived acts are nearly optimal. 


ain classes of people 
ably relevant observation, that tend to min- 


agree, in 


Again, there are cir- 
e will agree that cer- 


4 The minimax theory 


In recent years there has been developed a theory of decision, here 
with due precedent to be called the minimax theory, that embraces so 
much of current statistical theory that the remaining chapters can 
largely be built around it. The minimax theory was originated and 
much developed by A. Wal 


c ped | d, whose work on it is almost completely 
summarized in his book [W3]. Wald’s minimax theory, of course, de- 


rives from, and reflects the body of Statistical theory that had been 
developed by others, particularly the ideas associated with the names of 
J. Neyman and E. S. Pearson, It seems likely that, in the development 
of the minimax theory, Wald owed much to von Neumann’s treatment 
of what von Neumann calls zero-sum two-person games, which though 
conceptually remote from statistics, is mathematically all but identical 
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with study of the minimax rule, the characteristic feature of the mini- 
max theory. 

Wald in his publications, and even in conversation, held himself 
aloof from extramathematical questions of the foundations of statistics; 
and therefore many of the opinions expressed in later chapters on such 
points in connection with the minimax theory were neither supported 
nor opposed by him. It may fairly be said, however, that he was an 
objectivist and that his work was strongly motivated by objectivistic 
ideas. 

My policy here of holding difficulties of mathematical technique to a 
minimum by making stringent simplifying assumptions will be adhered 
to in connection with the minimax theory. A large part of Wald’s book 
[W3] is concerned with overcoming the difficulties in technique that are 
here avoided by simplifying assumptions, but that must be faced in 
many practical problems. Despite Wald’s able effort, important prob- 
lems of analytic technique still remain in connection with the minimax 
theory. It should also be appreciated that the individual mathematical 
problems raised by applications of the minimax theory are often very 
awkward, even when stringent simplifying assumptions are complied 
with; consequently much work on specific applications of the theory is 


still in progress. 


CHAPTER 9 


Introduction to 
the Minimax Theory 


1 Introduction 


This chapter explains what the minimax theory is, almost without 
reference to the theory of personal probability. This course seems best, 
because the theory was originated from an objectivistic point of view 
and as the solution of an objectivistic problem. Moreover, 
sophically more neutral presentation seems to result, 
sonal probability are here kept out of the foreground. 

The minimax theory begins with some of the ideas with which the 
theory of personal probability, 
In particular, the notions of pe: 


a philo- 
if the ideas of per- 


defined. That, incidentally, 
tivistic statistics. The parti 
for mathematical simplicity, 
tition B;. 

The objectivistic position is not in principle opposed to the concept 
of utility. In particular, the mi 


minimax theory is predicated on the idea 
158 


is why partition p 
tion in question i 
it will here be as 


roblems dominate objec- 
s in general infinite, but, 
sumed to be a finite par- 
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that the consequences of those acts with which it deals are measured 
numerically by a quantity the expected value of which the person 
wishes to have as large as possible, whenever (from the objectivistic 
point of view) the concept of expected value applies. It will therefore 
be doing the minimax theory little or no injustice to postulate here, as 
elsewhere, that the consequences of acts are measured in utility. 

These preliminaries disposed of, the general objectivistic decision 
problem is to decide on an act f in some given F, by criteria depending 
only on the conditional expectations E| B), and therefore without 
reference to the “meaningless” P(B;). 

Taking any personalistic or necessary point of view literally, it is 
nonsensical to pose an objectivistic decision problem, that is, to ask 
which f of F is best for the person, without reference to the P(B,;). On 
the other hand, many, if not all, holders of objectivistic views, like Wald, 
find themselves logically compelled by two widely held tenets to con- 
sider such problems meaningful. First, for reasons I have alluded to in 
Chapter 2 and will soon expand upon, many theoretical statisticians 
today agree, at least tacitly, that the object, or at any rate one object, 
of statistics is to recommend wise action in the face of uncertainty—a 
point of view that Wald was particularly active in bringing to the fore. 
Second, statisticians of the British-American School, of which Wald is 
to be considered a member, are objectivists and are therefore committed 
to the view that the probabilities P(B,) are meaningless, or, at any 
rate, that they cannot be legitimately used in solutions of statistical 
problems. p 

So far as I know, Wald is the only one who has proposed any solution 
to the general objectivistic decision problem, barring minor variations. 
His proposal, which is here called the minimax theory, is rather compli- 
cated to state. In view of its complexity and the importance of this 
theory for the rest of this book, and for statistical theory generally, I 
hope the reader will have particular patience with the present chapter. 


2 The behavioralistic outlook oy: 
t is here called the objectivistic 


Prior to Wald’s formulation of wha 
decision problem, the problems of statistics = mi oi aen 
i +5. what to say rather than what to do, oug! 
of as problems of deciding what to say ere bP age 


there had already been some interest in replacin, alistic | 
behavioralistie outlook. The first emphasis of the behavioralistic out- 


look in statistics was apparently made by a S dhn 
where he coined the term “inductive behavior 10 oppositio: 


ductive inference.” In the verbalistic outlook, which still ome 
most everyday statistical thought, the basic acts are supposed to be 
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assertions; and schemes based on observation are sought that seldom 
lead to false, or at any rate grossly inaccurate, assertions. Ri 

The verbalistic outlook in statistics seems to have its origin in the 
verbalistic outlook in probability criticized in § 2.1, which in turn is 
traceable to the ancient tradition in epistomology that deductive and in- 
ductive inference are closely analogous processes. 

I, and I believe others sympathetic with Wald’s work, would analyze 
the verbalistic outlook in statistics thus: Whatever an assertion may 
be, it is an act; and deciding what to assert is an instance of deciding 
how to act. Therefore decision problems formulated in terms of acts 
are no less general than those formulated in terms of assertions. 

If, on the other hand, a sufficiently broad interpretation is put on the 
notion of assertion, perhaps every decision to adopt an act can be re- 
garded as an assertion to the effect that that act is the best available, 
in which case the difference between the verbalistic and the behavioral- 
istic outlooks is only terminological; but I do think that, even under 
such an interpretation, the behavioralistic outlook with its tendency 
to emphasize consequences offers the better terminology. 

Fallacious attempts to analyze away the difference between the ver- 
balistic and behavioralistic viewpoints are also sometimes put forward, 
especially in informal discussion. For example, it is sometimes said 
that one should act as though his best estimate of a quantity were in 
fact the quantity itself. But on that basis few of us would buy life 


insurance for next year, for we do not typically estimate the year of 
our death to be so close. Other examples are discussed by Carnap in 
Section 50 of [Cl]. 


any verbalistic con- 
more behavioralistic 


n a ie is really too brief and must be supplemented by certain 
marks. To begin with, the reader ma isti 

Y wonder whether the verbalistic 

outlook has adherents who defend it against the behaviorist and if 

y 


so what their arguments may be. Actually, the Statistical public seems 


—_—- 


| 
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to greet the behavioralistic outlook as a relatively new idea—how old 
it may actually be is beside the point here—which as such must be re- 
garded with some skepticism. To the best of my knowledge, however, 
only one objection against the behavioralistic outlook has been pre- 
sented. It must be discussed next. 


It has been seen as an objection to 
the consequences of some assertions, particularly those of pure science, 


are extremely subtle and difficult to appraise. As a function of the true 
but unknown velocity of light, what, for example, will be the conse- 
quences of asserting that the velocity of light is between 2.99 X 101° 
and 3.01 X 10!° centimeters per second? But, if some acts do have 
subtle consequences, that difficulty cannot properly be met by denying 
that they are acts or by ignoring their consequences. Certain practical 
solutions of the difficulty are known. For example, considerations of 
symmetry or continuity may, as is illustrated in Chapters 14 and 15, 
make a wise decision possible even in some cases where the explicit 
consequences of the available acts are beyond human reckoning. Again, 
analysis sketched in the next two paragraphs tends to show that asser- 
tions with extremely subtle consequences play a smaller role in science 
and other affairs than might at first be thought. 

No worker would actually publish—indeed no journal would accept 
—as research the hypothetical assertion about the velocity of light men- 
tioned in the paragraph above. The consequences might be subtle, if 
he did; but they would not be very important, for no one would take 
him seriously. An actual worker would do as much as was practical 


to say what observations relevant to the velocity of light he, and per- 
haps others, had performed and what had been observed. To be sure, 
his statement of the observations would typically be much condensed 5 
he would resort to sufficient statistics or other devices to put his reader 
rapidly in position to act as though the reader himself had made the 


observations. Assertions about the velocity of light, and countless 
others of that sort, are of course published in textbooks and handbooks. 
These assertions do indeed have complicated consequences, SO judgment 
is called for in the compilation of such books; but the seriousness of the 
consequences of their assertions is limited because of the possibility of 
referring to original research publications, 2 possibility serious text- 
books and handbooks facilitate by the inclusion of bibliographies. 

On the other hand, it is obvious that many problems described ac- 
cording to the verbalistic outlook as calling for decisions between asser- 
tions really call only for decisions between much more down-to-earth 
acts, such as whether to issue single- or double-edged razors to an army, 


the behavioralistic outlook that 
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how much postage to put on a parcel, or whether to have a watch re- 
adjusted. 


It is time now to turn back to objectivistic decision problems. 


3 Mixed acts 


Speaking with pedantic strictness, it might be said that Wald does 
not propose a solution for the general objectivistic decision problem, 
because, before undertaking a solution, he insists that F be subject, to 
a certain condition. On the other hand, he argues that the condition 
is typically met in practice; he might fairly have insisted that it is the 
very heart of much actual statistical practice, Before discussing the 
issue in detail, let me give a small but typical illustration of it, 

Suppose that in a rental library I am confronted with the choice be- 
tween two detective stories, each of which look: 
the other. At first sight it would seem th 
me, namely, to rent one book or 
there are other possibilities, 
ticular, I can eliminate one 


tition—in this example, a r; 
dependent of the relative 


The random varia- 
ble may as well be confined at the outset to tw 


o values corresponding to 
čs, and random variables as- 
re equivalent for the pur- 
s statistical practice, such 
cautions, readily provided 
tables of random numbers, and other devices, 

al objectivistic decision problem, Wald’s point 
can (except for mathematical technicalities) be formulated thus: If f, 
represents a finite number of elements of F, and d(r) isa corresponding 
set of non-negative numbers such that 2¢(r) = 1, then the person can 
make the mixed act 


(1) f= © oe, 


tive of which B; obtains so 
cally, the sum in (1) should, 


gral with respect to a probal 


bility measure. 
superfluous under the simplifying asssumptio 


aced by an inte- 
But such integrals become 


n, which is herewith made, 
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that there are in F a finite set of acts f,, to be called primary acts, with 
respect to which every act in F can be represented in the form (1). In 
the rental-library example, the two acts corresponding to the two books 
can be regarded as primary. 

Since mixed acts are also available from the personalistic point of 
view, it may well be asked whether it is advantageous to consider them 
in connection with that point of view, and, if not, how they can be of 
advantage from one point of view but not the other. The answer to 
the first part of the question is easy. Indeed, if f is defined by (1) then 
it is personalistically impossible that f should be definitely preferred to 
every f,, that is, that 


@) E(f) = È e(r)EE,) > max E), 


for a weighted mean cannot be greater than all its terms. Technical 
explanation of the efficacy of mixed acts from the objectivistic point of 
view can best be presented after the whole statement of the minimax 
rule, but those at all familiar with modern statistical practice will de- 
rive some insight from the remark that the usual preference of statis- 
ticians for random samples represents a preference for certain mixed 
acts, 


4 Income and loss 

It is sometimes suggestive, and in conformity with some statistical 
(though not quite with economic) usage, to refer to Æ(f | B:) as the 
income of f when B; obtains, and, correspondingly, to use the notation 
I(f; i). An important concept associated with the income is that which 
I shall refer to as the loss (symbolized by L(f; 4)) incurred by the act f 
when B; obtains. By that I mean the difference between the income 
the person could attain if he were able to act with the certain knowledge 
that B; obtained and that which he will attain if he decides on f when 
B; does in fact obtain. Formally, 


(1) L(t; i) = pr max I; i) — If; ù. 


If the person decides on f when B; obtains, L(f; 7) measures in terms of 
income the error he has made. If he were himself informed of B; after 
f had been chosen, which is not typically the case, L(f; i) would, so to 
speak, measure his cause for regret. On that account, some have pro- 
posed to call loss “regret,” but that term seems to me charged with 
emotion and liable to lead to such misinterpretation as that the loss 
necessarily becomes known to the person. On the other hand, the 
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term “loss” has been used by Wald in the sense of negative income, 


but in contexts where loss as defined here is, of the two senses, the only 
defensible one, as will be explained in § 8. I hope the sense proposed 
here will not cause serious confusion. 


Exercises 


1. For each 7, there is at least one primary act f, such that 


B) Ii i) = max If, i). 


Such a primary act may fairly be called correct for i 


2. L(£; i) = Zo(r)L(E,; D >0, equality holding if and only if f is a 
mixture of acts correct for 7, 


3. L(f; i) = max I(£,; i) — I(f; i). 
4. L(f; i) = —I(f; 2), if and only if 


(3) max I(f,; i) = 0. 


5 The minimax tule, and the principle of admissibility 

The most characteristic fe 
rule of behavior, or recomm 
called the minimax tule, 
act f’, such that 


(1) 


ature of the minimax theory is a certain 
endation to the person. This rule, to be 
can now be formulated thus: Decide on an 


max L(f’; i) = min max L(f; i), 
i Ber 


where f and f’ are, of course, confined to F. 


In words, the minimax rule recomm 
that the greatest loss that can possibly 
possible. An f satisfying the re 
be called a minimax act, and th 
max act will be called the minim: 
problem and written L*, Und, 
been made, it is not technicall 
max act exists. 


ends the choice of such an act, 
accrue to it shall be as small as 
commendation of the minimax rule will 


min max” in 
thus abbreviated, 


— 
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It may well happen that F contains more than one act that is mini- 
max for the problem, in which case the minimax rule recommends, not 
a particular act, but only that the choice be narrowed to the set of 
minimax acts. Some other criterion must then be invoked to narrow 
the choice further. In particular, it can be shown that at least one of 
the minimax acts is admissible, in the sense of § 6.4. As Wald indicates, 
it would, therefore, be an inexcusable violation of the sure-thing prin- 
ciple not to narrow the choice to admissible acts. This application of 
the sure-thing principle will be called the principle of admissibility. 
The minimax rule and the principle of admissibility constitute the sub- 
ject matter of, and thereby define, the minimax theory. 


6 Illustrations of the minimax rule 


It would be hard to imagine an objectivistic decision problem simpler 
than that of whether to make an even-money (or more accurately, even- 
utility) bet in favor of a certain event or to refrain from betting. That 
problem, therefore, provides a convenient first example of the minimax 
rule and the concepts associated with it. Supposing, as one may with- 
out loss of generality, that the bet is for one utile, the objectivistic de- 
cision problem is completely described by Table 1, which gives the in- 


TABLE 1. THE INCOME OF AN EVEN-MONEY BET, I(f,; i) 


Event 
Act 
Bı B: 
Bet, fı il al 
Don’t bet, fz 0 0 


come of each of the two primary acts for each of the two elements of 
the partition corresponding to the event in question and its com- 


plement. s ish 
In view of Exercises 4.2 and 4.3 the corresponding loss function is 


described by Table 2. Therefore, 
(1) max L(f; i) = max Zol) Llf; i) 


= max 6(i) > $, 
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i E S 
equality obtaining if and only if $(1) = ¢(2) = $. Therefore, L* = an 
and the only minimax act is f = $f, + fo. 


TABLE 2, THE Loss OF AN EVEN-MONEY BET, L(f,; i) 


Event 
Act 
By Bz 
fı 0 1 
fo 1 0 


In this problem, therefore, the minimax rule recommends that the 
person decide, in effect, by flipping a fair coin. If the odds in the bet 
had not been even, the minimax rule would have recommended the 
use of a coin with a certain bias; this more general example will be 
worked out in detail in § 12.4. It is noteworthy in connection with the 
present problem—for it happens in many others—that, for the minimax 
act f, L(f; i) = L* for every value of 7. 

The following more elaborate example, 
observation, is paraphrased from a sli 
Of three numbered coins, two are penn: 
is a penny and two are dimes, 
because any of the three coins m: 
The available primary acts are 
person may select one of the c 
may refrain from so doing; 
of the singular coin. His i 
conditions: 


illustrating the mechanism of 
ghtly incorrect example in [$2]. 
ies and one is a dime, or else one 
This gives rise to a sixfold partition B,, 
ay be the singular one, and in two ways. 
described in two stages thus: First, the 
oins by number for observation, or he 
second, he must guess at the denomination 
ncome in utiles is defined by the following 


1. If the singular coin is a penny, 
dime, he receives a bonus of 20. 

2. If he chooses to observe a coin, h 
1, regardless of the particular coin sele, 

3. If his guess is incorrect he pays a 


he must pay a tax of 10; if it is a 


e must pay an inspection fee of 
cted for observation, 
penalty of 8. 

It is easy to see that the first of the 
come is irrelevant to his loss, since his d 
nitude of that term. His loss is theref 
first of these is 1 or 0 depending on wh 
servation; the second is 0 or 8, dependi 


three terms in the person’s in- 
ecision does not affect the mag- 
ore the sum of two terms. The 
ether he decides to make an ob- 
ng on whether his guess is correct. 
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If the person chooses not to pay the inspection fee, it is clear from the 
preceding example that, no matter what he does, his loss may be as 
high as 4, and that it is certain to be that small if and only if he governs 
his guess (essentially) by the flip of a fair coin. 

Suppose next that the person decides to make an observation. If 
he selects any particular coin for observation, he is as badly off as he 
was before the observation, and he has in addition incurred the inspec- 
tion fee. Thus, even if the person knows that the first coin is a penny, 
there is nothing he can do to be sure that his total loss will not be more 
than 5, and, as before, he can guarantee that small a loss only by govern- 
ing his guess with the flip of a fair coin. 

I think every practicing statistician would say that, if an observation 
is to be made at all, one of the three coins should be selected at random 
(i.e., the probability 1/3 should be attached to observing each of them) 
and after the observation the person should guess that the singular 
coin is opposite in denomination to the one observed. It will be shown 
in the next paragraph that this common-sense act is minimax. 

In the first place, the loss L(fo; į) for the act fo in question is, for each 
i, equal to 1 +3 X 8 = 32, which is less than 4; for the inspection fee 
is 1 and the probability of making a wrong guess, which would result 
in the loss of 8, is 1/3. To show that fo is minimax, it will be enough to 
show that every act can result in a loss of at least 35. One possibility 
for doing this (which in § 12.3 will be shown to be a natural one to try) 
is to show that, for a certain set of weights, the weighted average of 
L(f; i) with respect to 7 is at least 32 for all f. In fact, it is sufficient, 
in view of Exercise 4.2, to establish such an inequality for the primary 
acts. In the present example, it happens that the weights can be cho- 
sen to be equal. What is to be shown, then, is that the following in- 


equality obtains for every primary f: 
(1) Lif) =r} DLE; i) 2 33- 


Now, if the primary act f does not involve observation, L(f) = 4; be- 


cause three of the six terms to be averaged are then 8, and the other 
that f involves the obser- 


three are 0. Suppose next, for definiteness, V 
vation of the first coin; there are then three possibilities to consider. 
First, the guess is made without regard for the denomination observed, 
in which case the observation is, so to speak, thrown away, making 
L(f) = 5. Second, the denomination guessed may be the same as the 
denomination observed, in which case the guess will be wrong for four 
of the six values of 7, making L(f) = 6}. Finally, the denomination 


guessed may be the opposite of the one observed, in which case the guess 
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will be wrong for two of the six values of 7, making Lif) = 33. This 
argument shows that L* > 33; and, since L(fo; i) = 33 for every 7, fo 
is a minimax act and L* = 33. It would not be difficult to show that 
fo is the only minimax act for this problem. 


7 Objectivistic motivation of the minimax rule 


The minimax rule recommends an act for the person to choose; more 
strictly, it recommends a sharp narrowing of his choice. But how can 
this particular recommendation be motivated? To the best of my 
knowledge no objectivistic motivation of the minimax rule has ever 
been published. In particular, Wald in his works always frankly put 
the rule forward without any motivation, saying simply that it might 
appeal tosome. Though my heart is no longer in the objectivistic point 


of view, I will in the next few paragraphs suggest a relatively objecti- 
vistic motivation of the rule. 


On the other hand, there are pra 
ight well be willing to accept the Tule—ey, i 
e 
holds a personalistic view of probab okie tg aed een 


necessity. 
is quite small 
merit serious 
n sense seem 
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nearly incredible. Suppose, for example, that I were faced with such 
a decision problem, in which it may be assumed for simplicity that there 
is only one minimax act f, and consider how I might defend the choice 
of that act to someone who proposed another to me. He might, for 
example, tell me that he knows from long experience, or by a tip from 
his broker, that some act g is preferable to f. “Well,” I might say, “I 
have all the respect in the world for you and your sources of informa- 
tion, but you can see for yourself—for it is objectively so—that the 
most I can lose if I adopt f is L*.” He will not be able to say the same 
for g, and in many actual situations the greatest possible loss under g 
may be many times as great as L* and of such a magnitude as to make 
a serious difference to me should it occur, which may well end the argu- 
ment so far as I am concerned. 

It is of interest, however, to imagine that my challenger presses me 
more closely, reminding me that I am a believer in personal probability, 
and that in fact I myself attach an expected loss L to g that is several 
times smaller than L*. Even then, depending on the circumstances, I 
might answer frankly that in practice the theory of personal probability 
is supposed to be an idealization of one’s own standards of behavior; 
that the idealization is often imperfect in such a way that an aura of 
vagueness is attached to many judgments of personal probability; that 


` sindeed in the present situation I do not feel I know my own mind well 


enough to act definitely on the idea that the expected loss for g really 
is L; but that I do, of course, feel perfectly confident that f cannot re- 
sult in a loss greater than L*, a prospect that in the case at hand does 
not distress me much. 

It seems to me that any motivation of the minimax principle, ob- 
jectivistic or personalistic, depends on the idea that decision problems 
with relatively small values of L* often occur in practice. The mecha- 
nism responsible for this is the possibility of observation. The cost of 
a particular observation typically does not depend at all on the uses to 
which it is to be put, so when large issues are at stake an act incorporat- 
ing a relatively cheap observation may sometimes have a relatively 
small maximum loss. In particular, the income, so to speak, from an 
important scientific observation may accrue copiously to all mankind 


generation after generation. 
8 Loss as opposed to negative income in the minimax rule 

As a variant to the minimax rule as I have stated (or perhaps I should 
say interpreted) it, one might consider the possibility of letting the 
negative of income play the role of the loss in (5.1). Indeed, strictly 
speaking, Wald himself always proposed the minimax rule in that 
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form. I believe he never made written allusion to the rule formulated 
in terms of loss (as “loss” is defined here); orally he took the position 
that loss and the form of the minimax rule based on it were inventions 
of mine, toward which he was tentatively sympathetic. There is vir- 
tually no mathematical difference between the two rules, and it was 
characteristic of Wald’s approach to the foundations of statistics to be 
reluctant to commit himself with respect to any other differences. 
Though the minimax rule founded on the negative of income seems 
altogether untenable, as will soon be explained, and though no one but 
myself seems to question that I originated the variant of the theory 
based on loss, little or no originality is attributable to me in this re- 
spect. Wald more than foreshadowed the idea, for, though he based 
his minimax rule on the negative of income, he made it clear in publica- 
tions, including [W3], that he regarded as typical problems in which 
the income has, for every i, the property specified in Exercise 4.4. 
Therefore, in the situations Wald regarded as typical, the distinction 


between the two forms of the rule vanishes, so, until hearing his ex- 
plicit disavowal, I considered the idea of loss as opposed to negative 
income his. 

To see that the minimax rule found 
utterly untenable for statistics, consid 
tion problem with two primary acts in 


ed on the negative of income is 
er, for example, a twofold parti- 
which the income is as in Table 1.» 


TABLE 1. I(f,; i) 


———— 


Event 
Act |}-——______ 
Bı B, 
fı =i -1 
fy =10 1 
—— |) 


be at most 1, whichever B; 


to B; 


me will be at least 1 
he again has no recourse but 
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to decide on fı. In short, for the problem at hand, the person’s behavior 
would not be influenced by any observation, however relevant. This 
seems to me absurd on the face of it, but perhaps the absurdity can be 
brought out by a less abstract situation paralleling the example just 
given. A person has a ladder, and, just as he is about to use it, it oc- 
curs to him that the ladder may possibly be dangerously defective. 
He envisages two basic primary acts: fı, to throw the ladder away and 
buy a new one, which will cost 1 utile in either event; and fz, to use the 
ladder, which will, if the ladder is defective, result in his injury to the 
extent of 10 utiles, and will, if the ladder is sound, accomplish his ob- 
ject, which is worth 1 utile. Now, if the person acts on the principle of 
minimizing the maximum of negative income, he will throw the ladder 
away, no matter what tests tend to show that it is sound. 


CHAPTER, 10 


A Personalistic Reinterpretation 


of the Minimax Theory 


1 Introduction 


In this chapter a reinterp: 


the theory of personal probability and the idea that statistical 


problems 
are typically multipersonal, is tentatively 


put forward. The reinter- 


here. In particular, the liberty is taken 
ings in order to bring out the paralleli: 
tions. 


2 A model of group decision 


These people are 
r the consequences 


; that their jud ti 
172 i a 


| 
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as to questions of fact may differ; that, to put it technically, they may 
have different systems of personal probability. Still other situations 
resembling the group decision problem are widespread in science and 
industry, though the group decision problem does by no means repre- 
sent the only sort of social interaction tending to make the theory of 
personal probability, confined to a single person, inadequate. When- 
ever a hospital or a factory modifies its procedures, whenever a doctrine 
is adopted with little reservation by virtually all the workers in a 
science, or whenever a panel of experts drafts a report, something like 
group decision is taking place. 

Since the members of the group in a group decision problem, though 
required to act in concert, typically differ from one another in their 
probability judgments, it is too much to expect that any rule can be 
formulated that will be acceptable to, or in any sound sense proper for, 
all groups under all circumstances. On the other hand, there may be 
one or more rules of thumb that will lead the group to an acceptable 
compromise in many practical circumstances. Two such suggestions, 
the group minimax rule and the group principle of admissibility, will 
be made and explored in the next section. 


3 The group minimax rule, and the group principle of admissibility 

In the first place, the possibility of using mixed acts is to be pointed 
out. If, for example, you and I, walking together, disagree about which 
branch of a fork in the road leads home, we can, and in fact may, de- 
cide which to try by flipping a coin. 

In general, mixed acts are availab! oup deci rC 
reasons analogous to their availability in objectivistic decision prob- 
lems, for, though the members of a group may generally differ in the 
probabilities they personally assign to some events, there is in practice 
an abundance of events associated with coins, l 
and the like that make it possible for the group to mix the primary acts 
all members of the group being in agreement about 
what the proportions are. The example of the fork in the road illus- 
trates how the use of mixed acts can effect such a compromise as to 
make decision possible in what might otherwise be an impasse. As in 
the account of the objectivistic decision problems, it will therefore be 
taken for granted from now on that F contains all mixtures of its ele- 
ments, and once more, for mathematical simplicity, it will be assumed 
that there are a finite number of primary acts f, in F, of which all 
others are mixtures. 

The ith person in 
(personal) income, to the act f; call it I(f; 


Je in a group decision problem for 


cards, random numbers, 


in any proportion, 


the group attaches a certain expected utility, or 
i). In the judgment of the 
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ith person, adoption of the act f would represent a (personal) loss, 


a) L(f; i) = max I(f, i) — IŒ, à). 


(possibly zero) as compared with the income or expected utility that 

in his opinion would result from an act he considers most promising. 
The group minimax rule is the suggestion that an act be adopted 

such that the largest loss faced by any member of the group will be as 


small as possible. To put it formally, the suggestion is that an f’ be 
adopted such that 


(2) max L(f’; i) = L* = pt min max Léf, i). 
i ' 4 


The parallelism between the 


stated in § 9.5 is great. In particular, (2) is identical in appearance 


though a fruitful one, because 
meanings in the two contexts. 


is small, in a rather vague 
max rule. Indeed, if L* is 
member of the group to face 
e suggestion is a serious mis- 
up can suggest an alternative 
great as L*, for there is none. 


a circumstance which, when it occurs, 
the suggestion by making it seem fair, 

Of course it is possible that, as in the ob 
more than one act fulfilling the 
phrase of the principle of admi 
for if 


jectivistic interpretation, 
minimax principle exists, Here, a para- 
ssibility will further narrow the choice, 


(3) L(g; i) < LẸ; i) 


for every 7, with inequality obtaining for some i, the group cannot seri- 
ously consider f. 


4 Critique of the group minimax rule 


Some of the criticisms that have 


been, or may be, raised against the 
minimax rule can as well be discus 


sed in connection with one interpre- 
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tation as with the other, and Chapter 13 will be devoted to such criti- 
cisms. But some that bear specifically on the multipersonal interpre- 
tation in this chapter should be discussed here. 

In the first place, the group minimax rule is flagrantly undemocratic. 
In particular, the influence of an opinion, under the group minimax rule, 
is altogether independent of how many people in the group hold that 
opinion. In general, it is difficult to give a formal analysis of the concept 
of democratic decision, a point discussed at length by Arrow [A5], Hil- 
dreth [H4a], and others. Perhaps, considering that the people in the 
group are postulated to have a common utility function, a satisfactory 
analysis of democratic decisions could be given in the case of a group 
decision problem by some such procedure as minimizing the average 
with respect to 7 of L(f; i). But, in many situations in which I envisage 
application of the group minimax principle, the group will in fact be a 
rather nebulous body of people, for example the group of all specialists 
in some field. The principle would in such a case be administered by a 
single member of the group somewhat in the following fashion. In 
planning an investigation, the results of which he intends to publish, 
he will endeavor to take account of all opinions, so far as he can know 
or guess them, that are considered at all reasonable in his field of in- 
vestigation. And when he publishes his conclusions he will say, in 
effect, “Whatever reasonable opinions have heretofore been held by 
members of this specialty, in the light of my investigation and the min- 
imax rule, it is now proper for the members of the specialty, in so far 
as they are called upon to act in concert, to agree to such and such an 
action.” To put it a little differently, in such an application the group 
is rather fictitious, and the individual investigator is admitting as rea- 
sonable a rather large class of opinions, but excluding many that he 
is sure his confreres will agree are utterly absurd. He will, for example, 
feel quite free to exclude those opinions that almost all educated people 
regard as superstitious. 

The group minimax rule i 
cause, if one were to try to apply it in a real situation, 1 ¢ 
the group might well lie about their true probability judgments, in 
order to influence the decision generated by the minimax rule in the 
direction each considers correct. This objection is, however, scarcely 
serious in the fictitious sort of application suggested above. at 

It is appropriate, in terminating this section, to discuss a certain dis- 
tinction, neglect of which can, as was pointed out to me orally by Bruno 
de Finetti, lead to serious misunderstanding of the group minimax rule. 
Voluminous observation typically tends to make any one person almost 
certain of the truth, and also, when a group of people is involved, it 


s also objectionable in some contexts, be- 
the members of 
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typically tends to make L* small. These two tendencies, though re- 
lated, are separate phenomena, as an illustration will bring out. 

Suppose that Peter and Paul are required to bet 1 utile in concert 
either that the majority of a large electorate has voted for, or that it 
has voted against, a certain issue; but that before betting they are to 
be allowed to examine a random sample of 1,001 ballots. 

If specific opinions about the division of the electorate are assigned 
to Peter and Paul, the situation can be re 
problem. To start with an interestin 
that it is Peter’s unequivocal opinion t 
and 45% is against the issue and Paul’s that the division is 45% for 
and 55% against; that is, Peter, for example, is supposed to act as 
though he knows that the division is 55%-45%. 

If, finally, it is understood that 


in the two people, Peter and Paul, deciding, before the sample is ac- 
tually observed, how their bet is to 


garded as a group decision 
g extreme possibility, suppose 
hat 55% of the electorate is for 


fluctuation the sam- 
“knowledge” that the ma- 
about 0.0008. 

ossibility of observing the 
x 0008 as compared with the 
no sample were available, may well find the min- 
y rate, it is hard to see in 
other. 


ple will orroborate his 
jority is for the issue, Numerically, L* is 


one of the two people is immovably Wrong, and the observation of no 
sample, however large, can bring them } 


r both close to the truth. This 
brings out a contrast between the reduction of L* and the approach to 
certainty of the truth, both of which typically occur with the accumu- 
lation of evidence. j 
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The wary will ask, “Who will feel how, when the actual majority is 
disclosed and settlement made? What if Peter’s unequivocal opinion 
turns out to be false?” Such questions suggest that paradox lurks in 
an example in which different people unequivocally hold mutually in- 
consistent opinions, so there is some interest in considering a modifica- 
tion of the example, free of that objectionable feature. 

Suppose then that Peter and Paul, though strongly opinionated about 
the division of the electorate, are not absolutely unequivocal in their 
opinions. To be quite definite, suppose that Peter attaches probability 
1-107? to the division 55%-45% and probability 1071? to the divi- 
sion 45%-55%, and that Paul attaches the same probabilities but in 
the opposite order to the two divisions. Here, as in the example of the 
unequivocal opinions, the unique minimax act is to let the bet be chosen 
in accordance with the sample majority; L* is a trifle lower than before. 
Observation of the sample does now generally affect the opinions of the 
two people, but, though it radically reduces the minimax loss, it does 
not typically bring the two people into close agreement. If, for ex- 
ample, the division is in fact 45%-55%, Paul’s strong a priori belief 
that that is the actual division is almost sure to be strengthened by the 
sample, and Peter’s equally strong but false belief is almost sure to be 
weakened. Still, the probability is only about 1/2 that Peter will be 
led by the sample to attach an a posteriori probability even as great 
as 0.05 to the actual division. Thus, speaking loosely, but I think prac- 
tically, the approach to certainty of the truth is here not typically 
nearly so far advanced by observation as is the reduction of the mini- 
max loss. x 

It may not be superfluous to p 
alludes not only to the two different personat sys 
Peter and of Paul, but also to certain conditional probabilities that 
you and I have accepted hypothetically in setting up the example. 

Whichever division does actually obtain, it is rather probable that, 
once the sample is observed, either Peter or Paul will wish he could 
break his contract. This seems to me to reflect a serious objection to 
the group minimax principle, especially in those applications in which 
the members of the group are not literally consulted, for people cannot 
be expected to abide by disappointing contracts they might have made 
but didn’t. % 

For other approaches to the group decision 
[D6] and [D7a]. 


oint out that the preceding paragraph 
nal probability systems of 


problem see de Finetti 


CHAPTER J 


The Parallelism between 
the Minimax Theory and 
the Theory of Two-Person Games 


1 Introduction 


John von Neumann, in 1928 [V3], 
which two people play each other for money. This theory is mathe- 
matically so closely akin to that of the minimax rule and has had such 
influence on its development that it would be artificial to give an expo- 
sition of the minimax rule without saying something of the theory of 
what von Neumann calls zero-sum two-person games, though the ac- 


count given here must necessarily be highly compressed. The most 
convenient references in English to the 
games, should the reader be interested i 
[M3], and Chapters II and II 


developed a theory of games in 


2 Standard games 


A certain sort of game, 
You secretly choose a number r 


numbers 7 and 7 having 
(possibly negative) L(r; i), w 
known to both of us. Tt is 
of us finds money proportio 


nction of r and i, 
ms involved, each 


t In this completely independent development he was to some extent anticipated 
by Emil Borel. Consult [F9], [F10], and [B21] for details and further references. 
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At first sight, standard games look very dull, though it is immediately 
recognized that some such games are played. A tiny but typical ex- 
ample is the game of “Button, button, who’s got the button?” ; “Stone, 
paper, scissors” is almost as familiar an example; and others could be 
mentioned. But, and this seems remarkable at first, any game, except 
possibly those dependent on physical skill, can be viewed as a standard 
game. The great generality of standard games is demonstrated in de- 
tail in Chapter II of [V4], but informal discussion of a single example 
will render the idea intuitively clear. Suppose then that you and I are 
to play a game of poker (of a specified variety). At first sight poker 
does not seem to be a standard game, because it involves several ran- 
dom events, and several decisions on the part of each of us, some to be 
made in the light of others. But, it can be argued, there are only a 
finite number of different situations that can arise in the course of a 
game of poker. You could, therefore, in principle write into a notebook 
exactly which choice you would make in each of the possible situations 
with which you might be faced in playing poker with me. The number 
of possible ways of compiling such notebooks, or policies of play, is 
finite; so, except for limitations of time and patience, you will be at 
no disadvantage in playing one game with me, if you simply chose 
once and for all that one of the many possible policies of play that seems 
best to you. Similarly, from my point of view, the game consists, in 
principle, in choosing one policy of play. Once you have chosen one 
of the policies possible for you, say the rth, and I have chosen one of 
the policies possible for me, say the ith, the amount you will have to 
pay me at the termination of the game is a random variable. Since it 
is agreed that the payments are effectively in utiles for both of us, your 
payment to me is effectively the expected value of this random variable, 
which may be called L(r; i) and which is in principle known to both 
of us as a function of r and i. The elaborate game of two-person poker 
is thus exhibited, at some expense to realism, as a standard game. 

Regarding the choice of an r by you or an i by me as a primary act, 
both of us are at liberty to use mixed acts. Indeed, explicit attention 
apparently was first called to the possibility of using mixed acts by 
Borel (see [B21]), in just this context. 

Let f and g represent mixed acts assigning probabilities ¢(r) and y(i) 
to the values r and i, respectively. The standard game is now replaced 
by a somewhat different game in which you choose an f; I choose a g; 


and I pay you the amount L(f; g), where 
a) LE; g) = pt X Llr; DSM). 
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3 Minimax play 


Von Neumann adduces an argument, the statement of which will be 
briefly postponed, that, if you have respect for my intelligence, you will 
see to it that the most I can possibly take from you shall be as small 
as possible, that is, you will choose an f’ for which 


(1) max L(f’; g) = L* =p; min max L(f; g). 
g f g 


Symmetrically, according to his argument, I should choose a g’ such 
that 


(2) min L(f; g’) = Ls =p; max min L(f; g). 
f g f 


Since, making the recommended choice, 
not pay me more than L*, and I am correspondingly sure that you will 
not pay me less than Lx; it follows that Lx < L*. This inequality 
would, of course, have obtained even if mixed acts were not permitted, 
It is a remarkable mathematical fact (not to be proved in this book) 


that, permitting mixed acts, equality always obtains; so the special 
symbol Lx is superfluous here, 


The argument for the rec 


, you are sure that you will 


ommended choices rests on the equality of 
L* and Lx. You realize that I can take at least L* from you and that, 


if you are not careful, I may take more. On the other hand, I realize 
that you can prevent my taking more than L* from you and that, if 
I am not careful, I may get less, This suggests to many that a pair of 


intelligent players, each respecting the intelligence of the other, will 
each adopt one of the recommended acts. 


4 Parallelism and contrast with the minimax theories 

Some formal parallelism between the minimax theories of decision 
and the theory of zero-sum two-person games is evident, but the paral- 
lelism is much more complete than may appear at first sight. The mix- 
tures g are without counterpart in the two minimax theories of deci- 
sion, and the appearance of g in (8.1) at the Place where 7 appears in 
(9.5.1) may seem to mar the parallelism between these two equations. 
But, letting 


(1) Lf; i) =p; 3 Lr; i)o(r), 


in the game theory (in close parallelism with the q 


(2) 


ecision theories), 
Lf; g) = X LE; i)y(i) < max Lf; i), 
i i 
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and 
(3) max L(f; g) = max L(f; 7). 
g i 


Therefore (3.1) is equivalent to 
(4) max L(f'; i) = min max L(f; i) = L*. 
i £ # 


Thus from the point of view of the minimax theories of decision the 
g’s represent no material innovation and are at worst useless baggage. 
Actually, though of little if any relevance in the interpretation of the 
minimax theories, the g’s constitute a useful mathematical device. 
Their usefulness has in fact been illustrated in working out the second 
example in § 9.6 and will be systematically demonstrated in the next 
chapter, along with the usefulness of the apparently irrelevant ‘‘maxi- 
min” problem posed by (3.2) and of the fact that Le = L*. 

Some remarks on the possibility of interpreting the g’s in the minimax 
theories are postponed to the end of this section. 

In the game theory, L may be any function whatsoever of its argu- 
ments r and , but, in the decision theories, L is subject to the condition 


that, for every 7; 
(5) min L(r; i) = 0, 
r 
erpreted as L(f,; i). Here is the only 


the game theory and the decision 
matically slightly more general than 


where L(r; i) is of course to be int 
mathematical difference between 
theories, the former being mathe 
the latter. 


Though the mathematical differences are negligible, the intellectual 


difference between the situations leading to the game theory on the 
one hand and to the decision theories on the other is great. Serious 
misunderstandings of the (objectivistic) minimax theory have often re- 
sulted from identifying it with the game theory. Among other things, 
loss is then confounded with negative income, and the misconception 
that the (objectivistic) minimax rule is ultrapessimistic is created. I 
have even heard it stated on this account that the minimax rule amounts 
to the assumption that nature is malevolently opposed to the interests 

of the deciding person. i A 
Though mathematical convenience seems to be the basic reason for 
introducing the g’s in the minimax theories, it is tempting to ask whether 
the g’s have also some natural interpretation in those theories. At the 
cing interpretation in either theory, but 


moment, I do not see a convin e 
completeness demands an account of an interpretation suggested by 
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Wald for his version of the objectivistic theory, especially since this 
interpretation influenced some of Wald’s most widely used terminology. 

The objectivistic problem of deciding on an act in ignorance of which 
partition element B; obtains, the P(B;) being regarded as meaningless, 
suggests a new problem that may perhaps also be called objectivistic. 
The new problem arises on postulating that P(B;) is meaningful but 
utterly unknown, that is, P(B) = y(i), where the +(z)’s are the com- 
ponents of a g here interpreted as the a priori distribution unknown to 
the deciding person. 

Since for Wald “loss” was Synonymous with “negative expected in- 
come,” he naturally calculated the loss of the ney 


v problem thus: 
(6) LE; g) = —E( | g) 


E -E(f | B)P(B) 


ll 


= È LE; ùy), 


arriving thus at the very function su; 


ggested by the game theory. In 
Wald’s version of the theory, 


the new problem therefore amounts to 
the formal introduction of the g’s in connection with the old one, which 
neatly fulfills the reasonable expectation that there should be no ma- 
terial difference between regarding P(B;) as meaningless and regarding 
it as meaningful but utterly unknown, 


The suggested interpretation of a g as an unknown—or, to mirror 
Wald more faithfully, fictitious—a priori distribution does not work, 


however, if the loss function of the new problem is defined by (9.4.1), 
for the new function Lf; g) is not then generally the same as the func- 
tion L(f; g) Suggested by the game theory; thus 


(7) L(t; g) = max E ~ f| g) 


ll 


max Ð EE — £| Bayi) 


I 


max D (L(G; i) — LE; )}4@) 


Le; g) — min L; g) 
< Lif; g), 
equality holding for a ty 


only in the altogether t 
its elements. 


pical 8 (ie, a & such that y(i) > 0 for every 2) 
rivial situation that F is dominated by one of 
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Does this mean that, contrary to expectation, there is a material dif- 
ference between the new problem with loss L and the old one? The fol- 
lowing exercises show that it does not. 


Exercises 
1. max L(f; g) = max L(f; i). 
g i 


2. min max L(f; g) = L*. 
f g 

3. max L(f; g) = L*, if and only if max L(f; i) = L*. 
5 i 


CHAPTER 12 


The Mathematics 
of Minimax Problems 


1 Introduction 


Since the two different minimax decision theori 
zero-sum two-person games h 
be worth while to digress fo 


es and the theory of 
ave a common mathematical core, it will 


r a chapter even at the expense of some 
repetition, to discuss this common core mathematically, that is, vir- 


tually without reference to its various possible interpretations. The 


discussion will have to be drastically confined relative to the large body 
of relevant literature, but the reader who wishes to pursue the subject 
much further will find [B18], 


[V4], [W3], and [M3] to be key references. 
2 Abstract games 


To begin with a ver 
o the one of main interest 


; g) be the value of ; SS 


It will, however, be assumed for simplicity that 
for every f’ and g’ the quantities 
max Lif’; g), min L(f; g’) 
a) : i 


L* =p; wis ma LE; g), Le =prmaxmin L£; g) 
g f 
exist. To say that 


that a maximum, for example, exists is not only to say 
that the function in question is bounded from above, but also that the 


maximum value is actually attained for at least one value of the argu- 


ment. For want of a more neutral term, call the function L(f; g) an 
abstract game. i 


An f’ is called minimax, if and only if 
(2) max L(f'; g) = 1+, 
g 
and a g’ is called maximin, 


if and only if 
(8) 


min L(f; g') = Ly, 
f 
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The existence of minimax and maximin values of the variables is im- 
plicit in (1). It is an easy exercise to show that f' is minimax, if and 
only if 


(4) L@’';g) < L* 


for every g. 
The corresponding characterization of maximin g”s as those such 


that 
(5) Lf; g') > Ls 


for every f could similarly be shown. But the symmetry of the situa- 
tion is such that it would be superfluous to derive this characterization 
of a maximin explicitly. Indeed, every theorem, or general conclusion, 
about L(f; g) obviously has a dual, which arises on applying the theo- 
rem to the new abstract game L(g; f) with L(g; f) = —L(f; g). This 
is typical of what is known in mathematics as a duality principle. Hence- 
forth the duals of demonstrated conclusions, even when not explicitly 
stated, will be as freely used as the demonstrated conclusions them- 
selves, Some conclusions are of course self dual. Incidentally, another 
example of a duality principle was u € 
one rt pointed a in connection with Boolean algebra in § 2.4. f 
An argument showing that Lx < L* was given m connection with 
the theory of games. More formally, if f’ and g’ are, respectively, mini- 
max and maximin, then from (4) and (5) 


(6) L* > L(t’; 2’) = L». 


It is possible, indeed typical, that Lx < L*. Suppose, for example, 
that f and g are variables that take only two values and that Lif; 8) 
is described by Table 1. Here, as the reader should verify, both f’s 


sed in § 5.4, and a very important 


Tasis 1. L(f; g) 


g 

1 2 
Lo £ 
f 
2T (0 


and both g’s are minimax and maximin, respectively, and L* = 1, 


ote ; he identificati 
The following theorem is frequently applicable ay e i ification 
of minimax and maximin values of f and 8, and of L* and Ls. 


186 THE MATHEMATICS OF MINIMAX PROBLEMS [12.3 


THEOREM 1 If f’, g', and the number C are such that L(f'; 8) = c 
< L(f; g') for every f and g; then L* = Lx = C = Lif’; g’), f' is mini- 
max, and g’ is maximin. 


Proor. First, C > L*, because 


(7) C > max L(f'; g) > min max L(f; g) = L*; 
g f g 


and, dually, C < Lx. But Le < L*; so C < Lẹ < L* < C, that is, 
L* = Lx = C. Now (4) and (5) apply. @ 
COROLLARY 1 If f' and g' are such that L(f'; g) < L(f; g’) for every 


f and g; then f' and g’ are, respectively, minimax and maximin, and L* 
= Lx = Lif’; g’). 


3 Bilinear games 


If one stumbles somehow onto a pair f’, g’ satisfying the hypothesis 
of Corollary 2.1, then he has discovered a minimax, a maximin, and 
the values (in this case equal to each other) of L* and Lx. But that 
possibility of discovery does not exist unless L* = Lx, which at the 
level of generality of the last section is unusual. 
est, however, centers on a very 
be called bilinear games, 
variably equal to Lx. 

The definition of bilinear games involves several steps. First, con- 
sider an abstract game, L(r; i), based on a pair of variables, r and 7. 
The two variables are here assumed for simplicity to have only a finite 
number of possible values, an assumption that can, and for statistics 
must, be considerably relaxed. Next, let f and g be non-negative func- 
tions of r and i, respectively, arbitrary except for the constraint that 


() LI = Dow =1, 


Almost all real inter- 
special class of abstract games, here to 
for which it is demonstrable that L* is in- 


in short, probability measures on the r’s and ts, respectively. Finally, 
the bilinear game L(f; g) is defined thus. 


(2) LE; 8) =r D Ler; DOÀ. 


It is important to recognize that the duality principle continues to 
hold, that is, if L(f; g) is a bilinear game, then L(g; f) = —L(f; g) is 
also one. 
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In terms of the auxiliary functions 
Lf; i) =p: DLO MO), 
r 


(3) a 
Lr; g) =v: DLC DIO, 


the following equalities and inequalities can easily be verified by the 
reader. 
max L(f; g) = max L(f; 2), 
z i 


a ' 
min L(f; g) = min L(r; g). 
f r 


(5) min max L(r; i) > min max L(f; îi) = L* > Lx 


r i £ i 


max min L(r; g) => max min L(r; i). 

g r ie 
But more can be said in connection with (5), for it has been shown by 
von Neumann [V3] that for the special class of functions now under 
discussion L* is actually equal to Lx. This important equality cannot 
conveniently be proved here, but the interested reader can refer to the 
relatively simple proof given by von Neumann and Morgenstern in 
Section 17.6 of [V4] (reading first, if necessary, the introduction to the 
mathematics of convex sets that constitutes Chapter 16 of that book) 
or to the version of it presented in [B18]. 

In the light of the equality of L* and Lx, (5) becomes 


(6) min max L(r; i) > min max L(f; îi) = L* 
roi 3 


= max min L(r; g) 2 max min L(r; ù). 
g r i r 


In view of (4) and (6), Theorem 2.1 can be much improved upon for 


bilinear games: 


THEOREM 1 For bilinear games, the following three conditions on 


f’, g’, and C are equivalent: 
1. ff minimax, g’ maximin, and L* = C. 
2. Lif’; g) < C < Lee) for every f and g. 
3. L@;) <C< L(r;g) for every? and r. 
Proor. Condition 2 implies 1, by Theorem 2.1; 1 implies 3 by (6); 
and 3 implies 2 by (4). @ 
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COROLLARY 1 A necessary and sufficient condition that f be mini- 
max is that, for some g, L(f; i) < L(r; g) for every r and 7. Under 
that condition L* = L(f; g), and g is maximin. 


Corollary 1 seems an especially appropriate expression of Theorem 1 
in connection with the minimax decision theories, where the g’s are, after 
all, not really of interest in themselves. Theorem 1, and equivalently 
Corollary 1, are of great practical value. To be sure, there are algo- 
rithms, or rules (given by Shapley and Snow in [512]), by which L* 
and all minimax values of f can in principle be computed, but these al- 
gorithms are so awkward to apply that in practice one generally guesses 
one or more minimax f’s, and also a maximin g, on the basis of some 
clues, verifying the guess and evaluating L* by Corollary 1. To finish 
the job, one then finds, if one can, an argument to show that the mini- 


max f’s thus discovered are all there are. This rather imperfect pro- 
cedure is especially important, since it can relatively easily be extended 
to many situations in which r and 7 


are not confined to finite ranges, as 
does not seem to be true of the algo 


rithms. 
As was mentioned in § 10.3 and as the examples that have been given 


illustrate, if f is minimax, then L(f; 2) is in practice often actually equal 
to L* for all, or at least many, values of 7. Insight into that phenome- 
non is given by the following theorem. 

THEOREM 2 If ¢ is such th 
g(t) > 0, then L(f; i) = 


Proor. L(f; i) < L* 


at there exists a maximin g for which 
L* for every minimax f 
, because f is minimax. Therefore L(f; g), be- 
ing a weighted average of the Lf; i)’s, is at most L* ; and it is actually 
less, if any term with positive weight is not equal to L*. But L(f; g) 
= L*, because g is maximin. @ 


It can happen, and in statistical practice it often does happen, that 
every t satisfies the hypothesis of 


\ Theorem 2, in which case L(f; i) = 
L* for every i and every minimax f, 


Theorem 2 often provides a basis for 
g, and the value of L*, which can then be checked by application of 
Corollary 1. To take a simple example, suppose that there are n values 
ofr, a n of i. There may te Some reason to conjecture that each i 
1s used by some maximin g, that is, th: i sati i 
of Theorem 2. If the eneur i at each i satisfies the hypothesis 
the system of equations 


guessing a minimax f, a maximin 


is in fact true, then f(r) and L* satisfy 
LXV) + 0L* = 1 


È Ler; fr) — 1L* = 0, 


(7) 


— 0 a ŮŮÅŘĂ a 
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Typically, (7) as a system of n + 1 linear equations in n + 1 variables 
will have exactly one solution (f(r), L*). This solution, if the conjec- 
ture is valid, will actually consist of the components of a minimax f 
(in this case the only one) and the value of L*. But the conjecture is 
not yet confirmed. In particular, if any f(r) in the solution of (7) is 
negative, it is contradicted; if not, the investigation can proceed. The 
candidates for maximin values of g are now, by the dual of Theorem 2, 
among the solutions of the system. 


È Ig) + OL* 


ll 
- 


(8) 
0, 


ll 


> Ler; Dg) — 1L* 


where r is confined to the values for which f(r) > 0. To consider only 
the simplest and most typical case, suppose f(r) > 0 for every r. Re- 
garding L* as known, (8) consists of n + 1 equations for n variables, - 
which at first sight might be expected generally to have no solution. 
To put the matter differently, if one forgets for the moment that L* 
has been determined by (7), it might seem possible that (8) could lead 
to a different value, say L*’. But, using the latter part of (8) and then 


the first part of (7), it is seen that 
(9) È Ler, DÀ = LNL” = L", 


uals L*; so discrepancy between L* and 
s in the tentative program—irrespective 
of the number of 7’s participating in (8). Finally, if (8) leads to en 
one set of positive g(i)’s, it follows from Corollary 1 that the f and L 
derived from (7) are the unique minimax and the true value of L*, re- 
Spectively. 

The converse of Theorem 2 h 


and dually the double sum eq 
L* is not among the real snag 


as been proved by Bohnenblust, Karlin, 
and Shapley in [B19], though their proof cannot be reproduced here. 
As is pointed out by these authors, the converse does not extend at all 
readily to situations involving infinite ranges of 7 and 7. Theorem 2 


and its converse can be summarized thus: 


Turorem 3 There exists a maximin g for which g(¢) > 0, if and 


only if L(£; i) = L* for every minimax f 


4 An example of a bilinear game 


It is now convenient to discuss a 
examples, of bilinear games, name 
values, say 1 and 2. Two preliminar. 


certain example, or rather a class of 
ly those in which 7 takes only two 
y remarks will help to orient the 
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discussion. First, bilinear games in which 7 takes only one value are 
devoid of interest, for the minimax problem in that case is simply a 
problem of finding an ordinary minimum. Second, the discussion of bi- 
linear games in which z takes only two values includes, in effect, be- 
cause of the duality principle, the discussion of those in which r takes 
only two values. 

If 7 takes only the two values 1 and 2, the values g = {g(1), g(2)} 
can be represented graphically by points on an interval, as illustrated 
at the foot of Figure 1. For every r, L(r; g) is linear as a function of 


Figure 1 


eas is L(f; mier every f. Tt is, of course, just because the L(f; g) of a 
minear game is linear in this sense and its dual th “bi- 
linear.” In Figure 1 the five slant; i pel Snore s penao Bas 


functions L(r; g) of a bilinear 
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each of which has for simplicity been so chosen as to use, or mix, only 
two values of r. 

As may be verified by inspection, the particular bilinear game rep- 
resented by Figure 1 has the special property that min L(r; i) = 0 for 
each 7, which is the distinguishing property of those bilinear games that 
arise in connection with the minimax decision theories described in 
Chapters 9 and 10. 

Figure 1 bears a more than accidental resemblance to Figure 7.2.1. 


In particular, the concave function 


(1) min L(r; g) 


marked by heavy line segments in Figure 1 is closely analogous to the 
convex function so marked in Figure 7.2.1. The particular g empha- 
sized by Figure 1 is that for which the function (1) attains its maximum 
value, which according to (3.6) is L*. This g is therefore the unique 
maximin, It has been shown quite generally in [B19] that bilinear games 
with more than one minimax or maximin are, in a sense, unusual; 
Figure 1 makes it graphically clear that the special bilinear games now 
under consideration do usually have a unique maximin, because there 
is more than one maximin only in case (1) happens to have a horizontal 
Segment. ; 

What are the minimax f’s for the bilinear game represented by Figure 
1? According to the dual of Theorem 3.2, an 7 cannot be used in the 
formation of a minimax f unless L(r; g) = L* for the (in this case 
unique) maximin g. That consideration eliminates all but two of the 
”’s from consideration, and it is graphically clear that this will usually 
be the case for bilinear games in which i takes only two values. Theo- 
rem 3.2 itself, applied to the particular game under ama, i 
that the graph of L; g) as a function of g must be horizonta or el 
minimax f. The two preceding conditions together eliminate all va ues 
of f except the one corresponding to the horizontal dashed ie jii ea 
ure 1; and that f is indeed minimax, because L(f; i) = L* for both 
values of i. f 

To specialize still further, suppose that r as well as 7 print gd be 
values. Such a game can, of course, be represented graphically in t ; 
Spirit of Figure 1. Several qualitatively different a oa noe 
Which might, for example, be classified by the relation o i e La Te 
functions L(r, g) to each other. The reader should gap ani vet 2 
many or all of these possibilities for himself. The only one reate 
here will be that in which the two functions cross each other at an in- 


terior g, with one function sloping up and the other down. It is graphi- 
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cally clear that there will then be a unique minimax and a unique maxi- 
min, as will now be shown analytically. 


The condition postulated can be expressed without loss of generality 
thus: 


L(1;2)>L0;1), UE 1) > L(2; 2), 
@) L@;1)>L(1;1), LQ; 2) > L(2; 2). 
Or, more mnemonically, 
(3) L(1; 2), L(2; 1) > LG; 1), L(2; 2). 


It is conjectured, in this case on graphical grounds, that the program 


outlined in connection with (3.7-8) applies, and the reader can indeed 
verify that that program leads to the conclusion 


(4) 
where 


(5) 


L* = (L(1; 2)L(2; 1) — L(1; 1)L(2; 2)}/a, 


A = Lil; 2) + LQ; 1) — LU; 1) — L(2; 2); 


and that the unique minimax f and maximin g are 


6) ta = (LQ; 1) — L(2; 2)I/a 
J) = [L(1; 2) — La; v/a, 
(7) i = [LG; 2) — L(2; 2)]/A 
9(2) = [L(2; 1) — LG; ya. 


If the game arises from an 
(3) almost always applies. 
for the order of numbering, 


(8) L;I = 1@;2) <0 


so, if only the inequalities in 


application of the minimax decision theory, 
More precisely, in this case, except possibly 


and (1; 2), (2; 1) > 0; 


(4-7) specialize to (8) are both strict, (3) applies, Then 
(9) L* = L(1; 2)L(2; 1)/a, 

where 

(10) A= LQ; 2) + L(2; 1); 

(11) JM = L2; 1/4, fay = L(1; 2)/A, 

(12) g(1) = L(1; 2)/a, 9(2) = LQ; 1)/a, 
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5 Bilinear games exhibiting symmetry 

Mathematically the solution of a bilinear game is often simplified by 
considerations of symmetry. For statistical applications, the implica- 
tions of symmetry for bilinear games are of fundamental importance 
in so far as they represent a counterpart in the minimax theory of the 
disreputable but irrepressible principle of insufficient reason. This sec- 
tion discusses these implications in an elementary, but formal, way. 
It can be skimmed over or skipped outright without much detriment 
to the understanding of later sections. 

Any discussion of symmetry involves, at least implicitly, the branch 
of mathematics known as the theory of groups. Though what is to 
be said here about games exhibiting symmetry is intended to be clear 
without prior knowledge of the theory of groups, it may be mentioned 
that introductions to that subject are to be found in many places, for 
example in [B14]. 

It can, and in practice often does, 
some symmetry.{ This means that there are permuti 
bolized by 7, T”, ete., of the values of r among themselves and the values 
of 7 among themselves such that 
(1) L(Tr; Ti) = Lr; i) 

Tr and Ti are the values into which 
tations satisfying (1) are said to 


happen that a bilinear game has 
ations, here sym- 


for every r and 7, where, of course, 
T carries r and i respectively. Permu 
leave the game invariant, or to belong to the group (of symmetries) of the 
game. The permutation U that leaves every 7 and every i fixed must 
be counted among the permutations in the group of the game, but the 
game has no symmetry (worthy of the name) unless there are other 
permutations besides U in its group. s Re ETT 

An example of a game with high symmetry is the game implicit in 


the second example of § 9.6, for, to any permutation whatsoever of the 
Ives, there is a corresponding permu- 


six 7’s in that game among themse i 
ro permutations taken together leave 


tation of the r’s such that the tw take 

the game invariant. It was, of course, the exploitation of symmetry 

that made the treatment of that example relatively simple: ' 
Returning to bilinear games in general, if T and 7” are in the group 


of the game, then the product TT” defined by the condition that 
(2) (TT =p TT's), (TT) =D T(T'i) 
is obviously also a permutation in the group of the game. This multi- 


confused with th: 
he equation L(r; ) = 


trical games,” which are 
Thi at of “symmetri hos; 
a tl — L(i; r) is meaningful and true 


for every r and i. 
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plication of permutations somewhat resembles the ordinary multipli- 
cation of numbers. In particular, (TT) T” is evidently the same as 
T(T"T”), though it is not necessarily true that TT’ = T'T. 

Relative to this multiplication the permutation U plays the role of 
the unit, or number 1, in arithmetic, for it is obvious that TU = UT 
= T for any permutation T. 

For every permutation T, there is evidently a permutation T—, and 
one only, that undoes T, that is, one such that 77 = U. It is easy 
to see also that TTT! = U and that, if T is in the group of the game, 
Tis too. The notation T—! is of course motivated by the considera- 
tion that, relative to the multiplication of permutations, 7’! plays the 
role of the reciprocal of T. 

It will be adopted as a definition that Tf and Tg are the functions 
such that T(r) = f(T) and Tg(i) = g(T—t) for every permutation 
of T and for every r and 7. The intervention of 7! in this definition 
may at first seem arbitrary, but it is motivated by the following con- 
siderations. First, if f is, for example, the function such that f(ro) = 1 
and f(r) = 0 for r = ro, then Tf should be such that Tf(Tro) = 1 and 
Thr) = 0 for r = Tro, Second, S(Tf) should be (ST)£ rather than 


(TS)f. The definition having been adopted, L(Tf; Tg) can be calcu- 
lated thus: 


(3) LTE; Te) = X Le; ONT) 
= DLT; Ti)f(L- Tr) g(T— Ti) 


= È L(Tr; TOOJA), 


where the basic fact is exploited that, if r, į runs once through all pairs 
of values, then Tr, Ti also does s0. It follows from (1) and (3) that, if 
T is in the group of the game, then 


(4) L(T£; Tg) = Lif; g). 

An f (g) is called invariant ur 
Tf = f (Tg = g) for every T i 
construct from any f an f inv 
Namely, let 


uder the group of the game, if and only if 
n the group. There is a natural way to 
ariant under the group, and dually for g. 


(5) Nn r 


= ~ 
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where (here and throughout this section) n is the number of elements 
in the group and the summation is over all elements of the group. The 
definition (5) accomplishes its objective, because 


- 1 
(6) Zio => x ET) 
= t SS l= ws 1, 
n T n 
and 
(7) TFE) = KT) 


-Igar 
n r 


=e E TTi) = Fr) 
n r 


for every r and for every 7” in the group. In (7) use is made of the 
easily established facts that gupi Coi and that as T runs 
once through the group so does T”T. The justification of is, of course, 
dual to that of Ë. It is noteworthy that f = f, if and only if f is invariant 


under the group of the game. 


Suppose R (J) is a set of the 7’s 
(i e TI), if and only if Tr e R (T7 
under the group of the game, if and only 
T in the group. 


(i’s). Then, by definition, r € TR 
1; ¢ I); and the set R (I) is invariant 
if TR = R (TI = I) for every 


Exercises 


la. If R is invariant, so is ~R. 
1b. If R and R’ are invariant, so are F n R’ and R 7 R: 
le. The vacuous set and the set of all r’s are invariant. 
2. For every R, let Æ = pt Ur TR, where T is of course confined to 
the group; and, for every r, define the trajectory of r as [r], where [r] is, 
j . 
as is customary, the set whose only element isr. 
(a) R is the smallest invariant set containng Ro A 
(b) Z is the intersection of all invariant sets containing R. 


©) R= UL. 

2 . . 
(a) fr] is the smallest invariant set of which 7 is an element. 
3a. If R is invariant, and R N [r] # 0, then R D [r]. 


3b. If R is invariant, and r £ R, then RD [r]. 


3c. If fr] N [r] # 0, then (r] = [rl 
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4a. The following conditions are equivalent: 
a. R is invariant. 
B R =R. = 
y. For every r e R, [r] C R. f ; 
ô. R is partitioned into sets each of which is a trajectory. 
4b. The following conditions are equivalent: 
a. f is invariant. 
8. The set of 7’s for which f takes any given value is invariant. 
y. f is constant on every trajectory. 
5a. If T’r = r, then (T7’T™)Tr = Tr. 
5b. If {r} denotes the number of elements of the group that leave r 
fixed, then {r} = {Tr}. = 
5c. If || r|| denotes the number of elements in [r], then n = {r}|| r |]. 
5d. Both {r} and || r || are divisors of n. 
5e. The value of f everywhere on the trajectory of r is 


1 
(8) rl Zs (r). 


6. Note the dual of each of the preceding exercises. 


In the establishment of all these preliminaries, the theory of bilinear 
games has been almost lost sight of, but it is now possible to say much 
about the significance of invariant functions and sets for bilinear games. 


I begin with a theorem valued for some of its corollaries rather than 
for any charm of its own. 


THEOREM 1 If Lit’; Tg) < Le’; Tg) for every T, then Lif’; g) < 
Lf’; g). If in addition Lif’; g) < L(t”; g), then LË; g) < LEEY 
Proor. 


(9) L(T~'f'; g) = Lf’; Tg) < L(t”; Tg). 
Therefore 

1 
(10) LË; 8) =- © LTP; g) 

n r 


1 
EDD = L(t": 
Sag LE Te) = Le"; B). 


Tf Lif’; g) < Lit’; g), 


then (9) is strict for T = U, and therefore (10) 
is also strict. @ 


COROLLARY 1 If L; Tg) = Lie” ; Tg) for every T, then L(f’; g) = 
LE"; 8). 
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COROLLARY 2 If L@’; g) = L(t”; g) for every g, then L@; g) = 
L(t"; &) for every g. 

COROLLARY 3 LĒ; g) = L(f; ®) = LË; ® for every f and g. 
COROLLARY 4 If f is invariant under the group of the game, L(f; g) 
= L(f; &) for every g. 

Paraphrasing some of the nomenclature of § 6.4, if L(f’; g) < L€”; g) 
for every g, say that f’ dominates f”; if f' dominates f”, but f” does not 
dominate f’, say that f’ strictly dominates f”; if f’ dominates f”, and f” 
dominates f’, say that f’ and f” are equivalent; if f’ is not strictly domi- 
nated by any f, say that f’ is admissible. 

COROLLARY 5 If f’ dominates, strictly dominates, or is equivalent 
to f”, then f’ dominates, strictly dominates, or is equivalent to Ë”, re- 
spectively. 

Cororrary 6 If Lf; Tg) < LĒ; Tg) for every T, then L(f; g) = 
L; g). 

Corortary7 If L(t; 2) < LĒ; 4) for every i e I, where I is invari- 
ant under the group of the game, then L(f; i) = L; ù) for i e I. 
COROLLARY 8 Tt is impossible that f strictly dominates f. 
THEOREM 2 max L; g) < max L(f; g), equality holding, if and only 


g e Ra: 
if the right-hand maximum is attained for a g invariant under the group 


of the game. 
Proor. 
(11) max LĒ; g) = max L(f; £) 
& & 


< max L(f; g). 
g 


follows from the fact that every & is a g; equality 


The inequality in (11) that is 
4 


holds, if and only if the final maximum is attained for some g, 
for some invariant g. ® 


Corottary 9 If f is minimax, so is Ê. 
Corotzary 10 There exists a minimax f invariant under the group 
of the game. 

more than one minimax f, it is tempting to suppose 


If 3 ; 
a game has all, applications of the theory an invariant, 


that in statistical, if not in 
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or symmetrical, minimax f would recommend itself at least as highly 
as any other minimax f. This supposition, being vague, cannot be 
really proved, but certain facts tend to support it. In particular, the 
following theorem is a reassuring improvement of Corollary 10. 


THEOREM 3 There is at least one admissible, invariant, minimax f. 


Proor. It is a direct consequence of a theorem (‘Theorem 2.22, p. 54, 
of [W3]) of Wald’s, too technical for statement or proof here, that at 
least one invariant minimax f is strictly dominated by no invariant f’. 
If that f were strictly dominated by any f” (invariant or not), it would 
also, according to Corollary 5, be dominated by £’, which is impossible. 
Therefore f is admissible. @ 

If the bilinear game has high symmetry or, more explicitly, if the 
number of trajectories into which the r’s or the vs, or both, 
tioned is small; the search for invariant minimax f’ 
maximin g’s is relatively simple. 
ized as an invariant f’ such that 


(12) 


are parti- 
s and invariant 
An invariant minimax is character- 


max L(f'; g) = min max L(f; g) = L*, 
g f g 


But, since at least one invariant mini 
not changed if the minimization on it 
ant f’s; with f so confined, the crite 
maximizations are confined to invar 
Thus the search for invariant minimax f’ 
amounts to the solution of an abstra 
nal bilinear game by ruling out cert 
un-invariant ones. 


This new and smaller abstract game can be exhibited as a bilinear 
game thus: Let it be understood for the moment that 7’ ranges over 


such a set of the r’s that there is exactly one 7’ in every trajectory [r]; 
dually for 2’. For invariant f and g, 


(13) LE 8) = X X Lir DG) 


=EL E rei 


vv repr] iefy 


imax exists, the criterion (12) is 
s right side is confined to invari- 


“LUI y F L(r; i) 


ref] iciT 


SE ELO, 
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where 
1 
14 Lied) =p = 
(14) (ri) DMEM ey oe 
and 
Ga) PEY = ville 905 9 @) = ell loe. 


Finally, it is easily verified that, except for the conditions f’(r’) > 0, 
g'i) > 0, and Sf’(r’) = Zg'(’) = 1, the coefficients f’(7’) and g’(z’) are 
arbitrary. The new game is therefore to all intents and purposes a bi- 
as many r”s and 7s as there are 7-trajectories 


linear game with only 
The new game, 


and 7-trajectories, respectively, in the original game. 
incidentally, may well have symmetry of its own. 

If there is only one r- or one i-trajectory, the new game is so simple it 
scarcely deserves to be called a game. This occurs, for example, in the 
second example of § 9.6, where there is only one i-trajectory. In that 
situation there is only one invariant g, and it is equal at every 7 to the 
reciprocal of the total number of 7’s (which is here the value of Ilall 
for every 7). That g must therefore be an admissible maximin. The 


value of L* is therefore given by 
1 z 
(16) L* = min la > Ler, 2). 3 
r U i 


The invariant minimax f’s are those and only those invariant f’s such 
that f(r) = 0 for every r that fails to minimize the sum in (16). More- 
over, here the minimax f’s (invariant or not) are all equivalent, as can 
be argued thus: Any invariant minimax f is such that 

(17) Lj g) = Li 8) = 2" 

f whatsoever failed to satisfy (17), it 


ding to Corollary 8 that is impos- 
hand all minimax f’s 


for every g. If any minimax 
would strictly dominate f; but accordans © 
sible. Therefore in the very special situation at 


satisfy (17) and are accordingly equivalent. : f 
It is, of course, important to extend consideration of symmetry to 


bilinear games with infinite sets of r's and #’s, and infinite groups of 
symmetries, but the task has not yet proved straightforward. Two key 
references bearing on it are [L4] and [B17]. 


CHAPTER 13 


Objections to 
the Minimax Rules 


1 Introduction 


T have already expressed and supported my opinion that neither the 
objectivistic nor the personalistic minimax rule can be categorically de- 
fended (§ 9.7 and § 10.3). On the other hand, certain objections have 
been leveled against the objectivistic rule (that being the well-known 


one) that seem to me to call for reinterpretation, if not outright refu- 
tation. 


2 A confusion between loss and negative income 


Some objections valid against the minimax rule based on negative 
income are irrelevant to that based on loss. The notions that the mini- 
max rule is ultrapessimistic and that it can lead to the ignoring of even 
extensive evidence have already been discussed as examples of such ob- 
jections. 

Another example I would put in the same cate 
by Hodges and Lehmann [H5]. 
served n independent tosses of « 


gory has been suggested 
In this example a person who has ob- 


1 coin for which the probability of heads 
has an unknown value p is required to predict the outcome of the 


(n + 1)th toss. Hodges and Lehmann here interpret prediction in the 
following somewhat sophisticated, but Teasonable, sense. The person. 
is, in the light of his observation, required to choose a number p be- 
tween 0 and 1 and to pay a fine of (1 — p)? or p? according as the 
(n + 1)th toss is in fact heads or tails. Thus the (expected) income 
attached to the primary act p and event p i 


is 
(1) Teip) = —p(l p — a — p)? 
= oT O piy) 
As Hodges and Lehmann show, 
that yields the minimax of the ne; 
tive of the observation. But it 


the only derived act (mixed or pure) 
gative income is to set p= 


is, in common sense, 
200 


3 irrespec- 
absurd thus to ig- 
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nore the observation of the first n tosses. In view of this absurdity, 
almost everyone would agree that applying the minimax rule directly 
to the negative of (1) is a foolish act for the person to employ. 

The absurdity of minimizing the maximum of negative income in 
this example is of course no valid argument against minimizing the 
maximum loss. It is easy to see that the loss corresponding to (1) is 


(2) L(p; p) = (p — pY. 


As Hodges and Lehmann happen to show in the same paper [H5] 
(though in a different context), and as will be discussed in some detail 
in §4, the unique minimax derived act does use the observations to 
advantage, resulting in a loss of 

a 


4(1 + n’4)? 
irrespective of p. The absurd act of setting p = 3 irrespective of the 


observation results in the loss (p — 4)°, which in any ordinary context 


would be inferior to (3), especially for large n. 
Incidentally, the minimax derived from (2), though not nearly so 


bad as setting p identically equal to 3, is itself open to a serious objec- 
tion, which will be explained in § 4. i 


(3) 


3 Utility and the minimax rule 


Some objections to the objectivi 
group, minimax rule are in effect o 
which underlies the minimax rules. 
have already been discussed in Chapter t 
certain aspects of the discussion need to be continued hare j ' 

It is often said, and I think with justice, that, even See the 
validity of the utility concept in principle, a person can seldom write 
down his income function I(r; ¢) with much accuracy. This idea is 
put forward sometimes with one interpretation and sometimes ee 
another, Of these, only the first is strictly an objection to the utility 
concept. 5 a a 

That one is a dilemma raised by the phenomenon : wee 
Vagueness may so blur a person’s utility judgments oe a ars i 
curately write down his income function. I Suppose per (o le r ! 
seriously deny this; I would be particularly ee i he -5 x 
it is almost a recapitulation of the very argument t n = s 7 hi 
in principle a personalist, to see some sense in the objec ae c 
problem. On the other horn, if all meaning 18 denied to utility (or m 
extension of that notion) no unification of statistics seems possible. 


stic, and mutatis mutandis to the 
bjections to the concept of utility, 
Criticisms of the concept of utility 
5, particularly in § 5.6, but 
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Three special circumstances are known to me under which escape from 
the dilemma is possible. First, there are problems in which some 
straightforward commodity, such as money, lives, man hours, hospital 
bed days, or submarines sighted, is obviously so nearly proportional to 
utility as to be substitutable for it. Second, there are problems in 
which exact or approximate minimax decisions can be calculated on 
the basis of only relatively little, and easily available, information about 
the income function, such as symmetry, monotoneity, or smoothness. 
The possibility of cheap extensive observation, which (when it occurs) 
makes the minimax principle attractive, also tends to make many de- 
cision problems fall into both of the two types in which the difficulty 
of vagueness is alleviated. For example, in a monetary decision prob- 
lem with cheap observation available, it often happens that the weak 
law of large numbers, and the like, can be invoked to justify regarding 
cash income as proportional to utility income. 

Third, there are many important problems, not necessarily lacking 
in richness of structure, in which there are ex: 
typified by overall success or failure in a venti 
as I have heard J. von Ni 


actly two consequences, 
ure. In such a problem, 
eumann stress, the utility can, without loss 
of generality, be set equal to 0 on the less desired and equal to 1 on the 
more desired of the two consequences. 


The second sense in which it may, 
said to be impossible to write down the 
this example. A manufacturer of small 
napkins, is faced with the problem of d 
pling to control the quality of his produ 


for this problem his utility is adequately measured by money, he can- 
not write down his income function because he does not know how the 
public will react to various levels of quality—that, in particular, the 
minimax rule does not tell him at all how much he ought to spend on 
the sampling program, though it may say how any given amount can 
best be employed. The manufacturer has a real difficulty, though he 


at the lack of knowledge that 

Ives not only the state of his 
product, but also the state of the public; taking the state of the public 
in writing down the income func- 


i e manufacturer to make observa- 
tions bearing on the state of the publi 


) c as well as those bearing on the 
state of the product, the minimax rule į 


though not quite properly, be 
income function is typified by 
short-lived objects, say paper 
eciding on a program of sam- 
ct. He complains that, though 


emove him from the paper- 
actice the personalistic method 
h the unknown state of the pub- 
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lic, while objectivistic methods, particularly the minimax principle, are 
now increasingly often used to deal with the state of the product—a 
sort of dualism having some parallel in almost all serious applications 
of statistics. This is not to deny that relatively objectivistic methods 
of market research can sometimes be used, nor that there are personal- 
istic elements aside from those concerning the state of the public in 
much of even the most advanced quality control practice. 


4 Almost sub-minimax acts 
Another sort of objection to the objectivistic minimax rule is illus- 
trated by the following example attributed to Herman Rubin and pub- 


lished by Hodges and Lehmann [H5]. An integer-valued random 
variable x subject to the binomial distribution 

n nr 
0) P(e| p) =| ) 70- P) 


is observed by a person who knows n but not p. His decision problem 
is to decide on a function ĵ of x subject to the loss function: 


(2) L(ĝ; p) = E(@ — p)? | p) 
n S 
=2 (ple) — p)? (") p(1 — p)"*. 


In other terms, he must estimate p on the basis of an observation of v 
and subject to a loss equal to the square of his error. The traditional 
estimate of p is defined by po(v) = z/n. This estimate has many vir- 
tues; it is the maximum-likelihood estimate, the only unbiased esti- 
mate, and (as is shown in [G1]) the only minimax estimate for a some- 
what different problem from that posed by (2). But for (2) the unique 
minimax is (as is shown in [H5]) defined by 


G- Bole) 


8) pilx) = fol) + “isa 
As it is straightforward to verify for every P, 
(4) L (bo; p) = 2 ; 
and 

1 
(5) Lı; p) = rr 


which constant is, therefore, L*. The ratio of the first of these functions 
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to the second is TE 
(6) 4p(1 — p) (: F =) , 


the maximum of which occurs at p = 1/2 and is 


1 2 
© (145): 


Thus, for large n, the maximum loss of Bo is larger than L* by only a 
slight fraction. Moreover, the loss of py is less than L* 


except when p 
lies in the interval where 


(8) DU =p > A +n-%)-, 
that is, where 


(9) |p-4| S31 — 4 0-2} a (any -%, 
To take a numerical example, consider n = 10° 
will note is rather big for a sample) 
p = 1/2 is then only 0.64%, 
from 1/2 in either direction, 
for example, to 3.5%, 15.5%, 
1.0, respectively. 

Many agree that in such an exam 
nary circumstances, prefer 
rule, ĝi. To my mind, this 


(which the practical 
- The advantage of p; over po at 
and, once p departs by as much as 0.04 
the advantage is with po. It amounts, 
o% in favor of Po, when p is 0.6, 0.8, 


ple good judgment will, under ordi- 
Po to the recommendation of the minimax 


example constitutes a valid objection against 
the minimax rule, in the sense that it demonstrates once more that, 


whatever value that rule may have, it is at best a rule of thumb. 
The example is a good illustration of the role of personal probability 


in ordinary statistical thinking, for the source of the dissatisfaction a 
person would ordinarily feel for ĝ 


a bove, for example, that, if the person 
attaches a probability of less than 


prefer Po to p;; the same conclusio: 
that the standard deviation of th 


Pı; the point of the example is only that there are situations in which 
that would clearly not be the case. 
Interesting material and important referen; 


0 ces bearing on the phe- 
nomenon illustrated by the decision problem u: 


nder discussion are given 
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by Wolfowitz in [W17]. It seems to be suggested there that the diffi- 
culty can be met by postulating some small amount e by which the 
person does not mind having his income decreased. Taken literally, 
this postulate implies on repeated application that all incomes are 
equivalent for the person, but Wolfowitz makes it clear that he does 
not mean to propose the postulate in a sense that allows repeated ap- 
plications. The idea is reminiscent of those theories of probability 
that permit the neglect of an occasional improbable event (mentioned 
in the last paragraph of § 4.4) and seems to me open to an objection 
similar to the one raised in connection with them. In particular, the 
choice of the e would be not only personal, but ill defined as well. 


5 The minimax rule does not generate a simple ordering 

Finally, an objection made by Chernoff [C7] to the objectivistic mini- 
max theory must be discussed. This will entail statement and illus- 
tration of the phenomenon on which the objection is based, and state- 
ment and analysis of the objection itself. 

The phenomenon pertains to the relation between two objectivistic 
decision problems, to be called for the moment the narrow and the 
wide problems. The narrow problem is determined by certain primary 
acts f,; and the wide one is determined by those primary acts and one 
more, say fp. In other words, the wide problem presents the person 
with one more choice than the narrow. Calling the two income func- 
tions I(f; i) and Io(f; i), it is to be understood, of course, that I(t; i) 
= Io(f; i) for any f that does not use, that is, give positive weight to, 
fy. The corresponding equation does not necessarily obtain. for the 
loss functions; indeed it clearly does so, if and only if the maximum of 
Io(f; i) in f can be attained for each 7 without using fp. Even in case 
no minimax of the wide game uses fo, it is therefore to be expected that 
the minimax f’s of the wide game will be different from those of the 
In fact, it can happen that no minimax of the wide game 


narrow game. pa 
a minimax of the narrow game; this is 


uses either fọ or any f, used by 1 ‘ 
the phenomenon to be discussed in this section. f 

To see how the phenomenon can occur, suppose that Figure 12.4.1 
represents the loss function of the narrow problem; and consider what 
the corresponding figure is for the wide problem, supposing that fọ is 


such that 
A =pr1(fo; 2) — max I (fr; 2)'> 0, 
r 


(a 
E =p; max I (fn; 1) — I (fo; 1) > 9. 
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It is clear that A and È can attain any positive values, irrespective of 
the structure of the narrow problem. The figure for the wide problem 
is constructed thus: The graph corresponding to each f, is left fixed at 
its right end and raised by the amount A at its left, and fy is represented 
by a line sloping up with slope = from the lower left-hand corner. It is 
easy to see that the raising of the left ends of the graphs of the f,’s can 
make any f, with a positive slope horizontal. If, further, such an f, 
minimizes L(f; g) for some g, it can be made a minimax by choosing 2 
sufficiently large. Thus, speaking specifically of Figure 12.4.1, the f, 
corresponding to the left segment of the heavy concave graph, which is 
not used in the minimax of the narrow problem, can become the unique 
minimax. Figure 12.4.1 is a little special in that the heavy concave 
graph has only one vertex to the left of the maximin of the narrow prob- 
lem. If there were more than one, the phenomenon could also be ex- 
hibited by making the second vertex to the left the unique maximin, 
which would occur for all A’s and 3’s in a certain range. 
nomenon occurs not only for isolated values of A and = 
for whole domains of values. 


Suppose, to take a striking case, that one f 


n Say fr, is the unique 
minimax for the narrow problem and a different one, f,., is the unique 
minimax for the wide problem. It is absurd, as Chernoff says in effect, 


to recommend f, as the best act among the f,’s when only the f,’s are 
available and then to recommend f 


class of possibilities. 
have geese, 


Thus the phe- 
but typically 


“Seeing that you 
ham.” 


, to the establish- 


s 0 e among acts. In so far as that 
can be done consistently with the sure-thing principle, personal proba- 


bility is practically defined thereby. If the sure-thing principle is vio- 
lated, the ordering is absurd as an expression of preference. For ex- 
ample, the rule of minimizing the maximum of the negative of income 
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does not exhibit the phenomenon. It amounts to considering f < f’, if 
and only if 


(2) max I(f; i) < max I(f’; i). 


This establishes a simple ordering, but one that violates the sure-thing 
principle by violating P2. 

The phenomenon has a particularly natural interpretation for: the 
group minimax rule. It would not be strange, for example, if a 
banquet committee about to agree to buy chicken should, on being in- 
formed that goose is also available, finally compromise on duck. 


GHAPTER 14 


- ‘The Minimax Theory 
Applied to Observations 


1 Introduction 


In this chapter the concept of observ 
point of view of the minimax rule. 
minimax problems should here be 
since mathematicall 
focus on one, 


ation is re-explored from the 
In principle, objectivistie and group 
treated on an equal footing. But, 
y the two theories are identical, it seems wisest to 
interjecting occasional digressions about the other. I 
have chosen to focus on the objectivistic problems. That choice, being 
in accordance with other literature on the minimax rule, will facilitate 
the reader’s further study of the subject, and it also renders more ob- 
vious the intimate connection between the minimax rules and the theory 


of partition problems presented in Chapter 7. The present chapter 
can indeed be regarded largely as a paraphrase of Chapter 7, so there 
will unavoidably be many 


references to the notations and conclusions 
of that chapter. 


2 Recapitulation of partition problems 


More explicitly, the basic problem may be any objectivistic problem. 
It will be characterized by the values of E(f | B,), where f ranges over 
a set of acts F subject to the conditions laid down in § 9.3, and B; is a 
partition. 

The observation is a random variable x (confined, as usual in this 
book, to a finite set of values), subject to the conditional distributions 
P(e | B,), and so articulated with F. that E| Bi, £) = B€ | B) for 
every x such that P(z | B) > 0. The last condition is (7.2.7); as men- 
tioned in connection with that equation, the condition will in particu- 

208 
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lar be met, if every f is constant on every B; a specialization costing 
but little in real generality. 

The derived problem (paralleling § 6.2) consists of F(x), the set of all 
functions assigning elements f of the basic acts F to values x of the 
observation x. The values of E(f(x) | B,) for f(x) £ F(x) are computable 
from the E(f | B;) and the P(x| B,) thus: 


(1) EEE) | B) = EEEE) | B: x) 
= Ð E((&) | Bi, x) P(e| Be) 


= Ð E(f(2) | B)P(E | Bi) 


It will now be shown that the set of derived acts F(x) satisfies the 
technical conditions imposed on the set of basic acts F, so that the 
derived problem is also an objectivistic decision problem. In fact, if 
every f e F is expressible in the form Df(r)f, (with the usual condition 
on the f(r)’s), primary acts for F(x) analogous to the f,’s can be defined 
by attaching to every function r = r(x) an element f(x; r) of F(x), 
where 
(2) f(x; r) = pt fr) 
ber of f(x; 1)’s, and all elements of F(x) are 
ages of them; the first assertion is obvious, 
f finding, for any system of proba- 


There are only a finite num 
expressible as weighted aver 


and the second poses the problem o sy’ 
bility measures ¢(7; x) on the 7’s, at least one probability measure on 


the set of functions r with respect to which P(r(x) = r) = $l; x) for 
every r and x. The problem typically has many solutions; the simplest 
is to let the r(a)’s, regarded for each x as functions of r, be independent 
random variables on the set of r’s considered as a probability space, 


that i 
at is, to set P(r) = IL ¢(r(a);*)- 


Formally, this particular solution leads to the identity 


(3) f(x) = È (r; fr 


= ey {I o(r(2’); z) fra): 


the coefficients in braces are non-nega- 
to check analytically, if it is recognized 
r means multiple summation with re- 


The identity and the fact that 
tive and add up to 1, are easy 
that summation with respect to 
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spect to r(1), r(2), --+ (the x’s being for definiteness supposed to take 
integral values). Equation (3) shows incidentally that it is immaterial 
whether it is before or after the observation that mixed acts are intro- 
duced. 

Turn momentarily to the idea of observation in group decision prob- 
lems. Here the E(f; B;)’s are replaced by If; z)’s, the expected income 
of f in the opinion of the ith person. There is no partition B;, except 
in a special, though theoretically important, case, namely that of the 
ith person holding unequivocally that B; obtains. 

The P(x | B,)’s are here replaced by P(x; 2)’s, the personal distribu- 
tion of x for the ith person. It is postulated that, for each person, the 
conditional expectation of f is unaffected by knowledge of x. 

The derived acts are formally the same as for an objectivistic decision 


problem, and the income function of the derived group decision prob- 
lem is 


(4) TE®)5 2) = D IE); DPC; i). 


Returning to objectivistic problems, (9.4.1) defines the loss function 


of the basic objectivistic problem and, mutatis mutandis, that of the 
derived problem also, thus: 


(5) L(f(x); i) = max E(f) | B) — EEE) | B3. 
The right side of (5 


) admits some simplification, for, if the person knew 
which B; obtained, 


observation would be valueless to him. Accord- 


ingly, 

(6) ME); i) = max EE | B) — EEG) | BY, 
Analytically, the simplification is justified thus: 

(7) max E(f | B) < max Ex) | B) 


= max X Effe) | B)P( | B) 


< max E(f | B3). 
f 
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bilinear games associated with the primary and derived problems are, 
respectively, 


(8) L(f; 8) = (6) — E| 8), 
(9) L(£(x); 8) = U8) — EEE) | 8) 
= U8) — © X Efl) | B)P@| BG) 


= 1(8) — > EE) | 8, 2) P| 8). 


If necessary, (9) can be interpreted and verified by comparison with 
(7.3.7) and (7.2.8), in that order. 

In Chapter 7, B(¢) was generally required not only to be non-negative, 
but also strictly positive; on examination, this slight difference from 
the present context will be found innocuous. Again, in Chapter 7, the 
statement and derivation of conclusions were, for simplicity, nominally 
confined to twofold partition problems. Here the extension of those 
conclusions to n-fold problems will be freely used, though some readers 
may prefer here, as there, to focus on twofold problems. 

Letting L* denote the minimax (and maximin) value of the basic, 
and L*(x) that of the derived problem, it is obvious, since F(x) D F, 
that L*(x) < L*; but there is some interest in viewing this inequality 
as a consequence of (7.3.4): 


(10) L*(x) = max min L(E(x); 8) 
B x, 
= max [1(@) — (F(x) |8) 
B 
< max [I(8) — o(F | 6)] 
B 


= max min L(f; 8) = L*. 
6 f 


It is clear that the maximin ’s for the basic and derived problems are 


the #’s that maximize the concave functions 
(11) n(8) = ns U8) — o(F | 8) = 16) — kC) 


and 

(12) A(G; x) = eG) — CF) | 6) = KO) — BER) | 8), 

for minimax f(x)’s, for example, is greatly 
tion that, if f(z) is minimax, EE) | 8) = 
d for every maximin £. According to § 7.3, 


respectively. The search 
narrowed by the considera 
»(F(x) | 8) for some £, indee 
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equality obtains in (10), if and only if there is a maximin 8 of the 
basic problem such that 


J l P(e | B)Bo() | 
(18) Bolt) =p: EPC] BIG Pa] BIRO 


is also a maximin of the basic problem for every x such that 


ZP(e | B;)Bo(7) > 0. 


The most typical possibility, and the only one to be explored here, is 
that the basic problem has a unique maximin Bo with Bo(7) > 0 for all 
j. Under this assumption, L*(x) = L*, if and only if x is utterly ir- 
relevant, as is easily shown. 

In the same spirit, as can easily be shown, L*(x) = 0, if x is defini- 
tive, but not typically otherwise; and, if x extends y, then L(x) > L(y) 
with equality if, and typically only if, y is sufficient for x. 


3 Sufficient statistics 


Digressing from the minimax rule for a moment, something more fun- 
damental can be said about a sufficient statistic y of x. Namely, for 
every f(x) e F(x), there exists an f(y) e F(y) such that J (f(y); 2) = 
IE); i) for every 7. Indeed fy) = D £(2)P(x | y) defines such an 


a step as the minimax rule, this re- 
objectivist loses nothing by exchang- 
for knowledge of a sufficient statistic 
have been expressed in § 7.4, except 
some circumlocution, mixed acts not 


act. Without appeal to so weak 
mark demonstrates that even an 
ing knowledge of an observation 
of it. The remark might as well 
that there it would have involved 
yet having been introduced. 


4 Simple dichotomy, an example 


Much of what has been said thus far is well illustrated by the mini- 
max counterpart of Exercise 7.5.2, 


The reader is accordingly asked to 
review that exercise and continue it thus: 
Exercises 

1. For the problem in question: 
(a) ACB) = 828(1) + è16(2) — | a6 
(b) h(8; x) 
= W8) + aB) — E | 637282) — anya) | {= Per| B») 
r j 


= ô[2P(r1 < 11*(8, Bo)| BY) + P(r = 7*(6, Bo) | B,)]8(1) 
+ Bl2P(r2 < r2*(8, Bo) | Bs) + P(r = r*(6, B0) | Bs)]6(2). 


(2) — 898(1) |. 
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2a. A B is maximin, if and only if r*(8, Bo) is such that 


(1) 82P(r1 < r1*(6, Bo) | B1) < &:P(r2 < 11*(6, Bo) | Be) 
and 
(2) 5oP(ry < 11*(B, Bo) | B1) > ô1P(r2 < 71*(B, Bo) | Bə). 


_ 2b. There is typically only one maximin, but there may be a closed 
interval of them. 


3. Though the acts of F and F(x) as defined by Exercise 7.5.2 do not 
it will suffice to consider mixtures of the f(x)’s. 


provide for mixed acts, 
ing will be lost by 


Each of these will be determined by an i, and noth 
requiring i to be of the form <(r(x))- 
4a. Any minimax will be equivalent to a mixture of f(x)’s each corre- 
sponding to a likelihood-ratio test associated with r*(8, Bo) for every 
maximin £. 
4b. In view of Exercise 3, the only likelih 
be considered for a minimax £ are: 


ood-ratio tests that need 


i(r) = 1, if and only if rı < rı*(6, Bo)- 
i(r) = 1, if and only if r1 < 71*(B, Bo). 


These are not necessarily different tests. ‘i 
5a. If the maximin £ is unique, the minimax act is unique (except 


possibly for equivalent acts) and is a mixture of exactly two f(x) ’s corre- 
sponding to the two likelihood-ratio tests defined in Exercise 4b. j 
This conclusion calls for some comment, for, in ordinary statistical 
practice, one or the other of the extreme likelihood-ratio tests is used, 
never a mixture. This practice is not in serious conflict with the mini- 
max rule, because the maximum loss associated with either extreme is 
typically only slightly greater than L*(x). Moreover, vagueness about 
the exact magnitude of 61 and ô, would usually frustrate any attempt 
to calculate the coefficients of the 


mixture. Incidentally, mixture is 
not called for at all when r is continuously distributed, for h(8, x) is 
then smooth rather than polygonal; that is, if Pe=r | Bi) E= 0 for 
every r’ and both 7’s, then h(8; x) has a continuous first e in B. 
To show this and to show that the derivative 1s ô2P (r1 z u na 
ô1P(ra < ra* | B2) may be taken as an exercise only slightly beyond the 
usual mathematical level of this book. 
5b. If there is more than one maximin ĝ, thel 
extreme has only one likelihood-ratio test associa n 
same one for all. The f(x) corresponding to that test 1 


only minimax. 


then any one that is not 
ted with it, and the 
s essentially the 
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5 The approach to certainty 


- In concluding the paraphrase of § 7.1-6 that has thus far been the 
subject of the present chapter, it should be mentioned that the approach 
to certainty studied in § 7.6 obviously implies that the corresponding 
L*(x(n)) approaches zero with increasing n. 


6 Cost of observation 


A cost c associated with an objectivistic observational problem di- 
minishes the income by E(c| B;) for each i, regardless of f; that is, al- 
lowing for the cost, I(f; i) = E(f — c | B,). But the cost, being un- 
avoidable, does not affect the loss function, so the minimax problem 
associated with the observation is independent of the cost. The costs 
do intervene, however, in an essential way in the problem of deciding 
which to choose of several available observations, Say Xa at cost Ca; it 
is important to bear in mind in connection with this problem that a null 
observation at zero cost is typically among the choices available in real 
life. The generic act of this compound problem can conveniently be 
symbolized by 2d(a)f(x2), or sometimes simply by À. Here, of course, 
A(a) = 0, ZA(a) = 1; for choice of \ means choice, for each a, of the 
probability \(a) that the ath observati 


on Xa will be made and also choice 
of the derived act f(x,) to be adopted in case x, is made. It is intuitively 


evident, and follows easily from (1) below, that the mixture of several 
\’s is also a À as far as income is concerned, so mixtures of \’s do not 
require explicit consideration. The income function can be written 


(1) TQ; i) = Eda) EE.) — ca | B3). 
Whence 

(2) max I(\; 7) = max E| B) — min E(ca | By). 
The loss function is accordingly 

(3) LA; B) = Xda) (Lala); B) + dal), 
where i 

(4) dalb) =p; x {E(c, | B) — min E(cq | B,)}6(i), 


and La(f(Xa); 8) is the loss functio; 
rived from the ath observation. 
The compound minimax problem is intimate} 


functions h(8; xa) and the linear functions dalb 
following exercises. 


n of the observational problem de- 


y related to the concave 
), as is explained by the 
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Exercises 
1. Show that 
(5) M(B) =i min L(A; 8) = min [h(8; Xa) + da(8)]. 
2. If\ = 1-f/(xa), then L(A; 8) = hy(6); if and only if: first, 
(6) Lar (£’ (Za); B) = h(B; Xe") 
(in which case f’ (xar) will be called well adapted to Xa and £); and, ee: 
(7) M(B; Xar) + dar(8) = min [h(B; Xa) + da(6)] 


(in which case xa will be called well adapted to b). 


3a. Show that 
(8) Ly* = pe min max L(A; 8) = max (b) 
Xo p B 


< min max [h(8; Xa) + da(8)]- 
a B 


3b. Under the important special condition that the da(8) are equal 


to constants da, (8) specializes to 
(9) Ly* < min [L*(xa) + dal. 


3c. When can equality hold in (8) and (9)? 

3d. B' is maximin, if and only if M(b') = Ly*. 

4. A = XA(a)f (£a) is minimax, if and only if: 

(a) For every a for which \(a) > 0, Xa is well adapted to every maxi- 
min £, and f(xq) is well adapted to Xa and every maximin £. 

(B) L(A; i) < Ly* for every i. (Of course (8) is alone necessary and 
sufficient; the point of the exercise is that the necessary condition (a) 
may conveniently confine t ax N's to relatively few 
candidates.) 

5. Suppose that: (œ) 7 and 
Lífs; i) = |r — i |; (8) x is con 
= 1/2, P(1 | Be) = 1/4; (y) ais con 
\’s of the compound problem attach weight \(1) to a basic act at zero 
cost and \(2) to an act derived from x at a non-negative constant cost 
d. Compute and graph: n(8), h(B; x), and (for various values of d) 
hy(B). Graph Ly* as a function of d, and discuss the minimax }’s for 


various values of d. 


7 Sequential probability ratio procedures 
The type of decision problem that in § 7.7 led to the concept of a 
sequential probability ratio procedure has an intimate counterpart in 


he search for minim: 


i are confined to the values 1 and 2, and 
fined to the values 1 and 2, and P(1 | Bı) 
fined to the values 1 and 2, and the 
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i nt type of compound objectivistic decision problem, for 
ee cos was in fact originally developed by Wald [W2]. 
The x,’s of a problem of this type range over the enormous variety of 
sequential observational programs associated with a sequence of (con- 
ditionally) identically distributed random variables x(1), x(2), =. 
The technical assumption that the a’s have a finite range is not fulfilled ; 
but, as in § 7.7, I proceed with some lapse of rigor, referring to Wald’s 
book [W3] or [A7] for the full details. Exercise 6.4 shows that atten- 
tion may be confined to a’s that are well adapted to at least one b, and 
that for those a’s it may be confined to f(x,)’s that are well adapted to 
Xa and the corresponding 8. The way is paved by § 7.7, which states 
sharply restrictive properties of the x,’s and f(x,)’s that are so adapted. 
In some cases, recognition of these properties contributes greatly to the 
possibility of actually computing minimax, or nearly minimax, 


pro- 
cedures for sequential problems. 


8 Randomization 


Another important type of compound problems is illustrated by the 
second example of §9.6. A generalization of part of that example is 
presented here to show how the minimax rule explains, or implies, the 
process called randomization, which is one of the most striking features 
of modern statistics, and one long antedating the minimax rule. Ran- 
domization represents the only important use of mixed acts that has 
thus far found favor with practicing statisticians, as will be discussed 
in the next section. The exact meaning of randomization seems a little 
elusive; no sharp definition is attempted here. But, roughly, random- 
ization is the selection of an observation at random; that is, of a A 
with more than one (a) actually positive, the choice of the \(a)’s and 
of the derived acts being governed largely by symmetry. The follow- 
ing example provides at least a fairly general illustration of the concept. 

To set the stage and provide motivation for a formal statement, the 
example will first be stated in language that is suggestive though a 
little vague. The consequences of the basic acts in the example de- 
pend on the composition of a population of n objects, which may be 
thought of as numbered from 1 through n. It may be known of some 
compositions that they cannot occur ; but, if a composition is considered 
possible, all populations having that composition (irrespective of order- 

Each observation in the compound 


ing) are also considered possible. 
problem consists in the cost-free observation of some m of the objects, 
every subset of exactly m objects being available for observation. 
Formally, the index z of the partition B; runs over a certain set I of 
n-tuples, {71, ---, in}, of elements considered for definiteness to be in- 
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tegers. If i = fi, --+, în} €I, then any permutation Ti of 7 is also in 
I. It is assumed that 


(1) EE |B) = EE | Bri) 


for every f e F, i e J, and permutation T. 

To every subset A of m integers, 1 < a (4) < a2(A) <- -° < adm—1(A) 
< am(A) < n, there corresponds an observation x(A) the possible val- 
ues of which are m-tuples {2:(A), «+, %m(A)}. The conditional dis- 
tributions of the z(A)’s are defined thus: If 2;(A) = iata), etc., then 
P(a,(A), +++, tm(A) | Ba) = 1. 

It is obvious that L*(x(A)) is the same for every A. In typical ap- 
alue is little, if at all, less than L*. 


plications this common v 
statistical common 


If a compound act =\(A)f(x(A)) is to be chosen, 
sense asserts that nothing is to be lost by: : 
(a) Letting \(A) be independent of A, and therefore equal to (i 


for every A; that is, letting every sample of size m have the same prob- 
ability of being chosen, or randomizing, as it is said. 

(b) Letting f(a:(A), +++) %m(A)) be symmetric in its m arguments 
and independent of A. 


It can in fact be shown, by the method illustrated in the second ex- 
ample of § 9.6 and discussed more generally in § 12.5, that there is at 
least one minimax satisfying (a) and (b), and even that there is an ad- 
missible one. Typically, if m is large, but small compared to n, Ly* 
is much smaller than the common value of the L*(x(A))’s. 

The importance of randomization in applied statistics can scarcely 
be exaggerated. From the personalistic viewpoint it is one of the most 
important ways to bring groups of people into virtual unanimity; from 
the objectivistie viewpoint it not only makes possible great reductions 


in maximum loss, but it is seen as an invention by which the theory of 
probability is brought to bear on situations to which probability on 


first (objectivistic) sight would seem irrelevant. ş 


9 Mixed acts in statistics 


Many have commented that modern applied statistics makes one, 


but only one, important use of mixed acts, namely in deciding, through 
the process of randomization, what to observe. Thus, for example, 
once the observation has been made, the derived act is in practice al- 
most always chosen, without mixing, from a set of basic acts natural to 
the problem. This might seem to imply a sharp conflict between the 
minimax rule and ordinary statistical practice; but actually it reflects 
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agreement, for mixed acts greatly reduce the minimax loss in decision- 
problem interpretations of typical practical statistical situations, when 
and only when ordinary practice calls for mixed acts of the same sort, 
namely when randomization is called for. ; 

There are certain mechanisms that systematically tend to make mixed 
acts have relatively little, or even absolutely no, advantage over un- 
mixed acts. In the following discussion of these mechanisms, let L(r; 7) 
be the abstract game on which a bilinear game L(f; g) is based. 

In the first place, supposing that L(r; i) is non-negative for every r 


and č (as is appropriate to the context now at hand), (12.3.6) can be 
completed, so to speak, thus: 


(1) L* min (R, I) > min max L(r; i), 


where R and J denote for the moment th 
respectively, and min (R, T) is of course 
gers R and J. An inequality stronger th 

Consider a minimax f for which the s 
the f(r)’s are actually positive: 


(2) R'L* = max R' © Lr; f(r) 


e number of values of r and A 
the minimum of the two inte- 
an (1) will actually be proved. 
mallest possible number R’ of 


= max L(r’; i) 
i 

2 min max L(r; i) 
ro i 


where r’ is so chosen that R’f(r’) > 1, as can ol 
known [B19] that R’ < min (R, 1). 

The important lesson of (1) is that, unless R 
the introduction of mixed acts cannot reduce t 
very small fraction of the value it would otherw 

To mention a different mechanism, Figure 1 
there are many 7s, the corners of the concave fı 
that figure may well be very blunt, in which cas 
has almost as high a maximum loss as any one of 
the number of 7’s is infinite, the concave functio; 
tiable, in which case mixed acts have absolute 
remark appended to Exercise 4.5a is pertinent | 

This mechanism can be related to a certain | 
stract (i.e, not necessarily bilinear) games, 
[K1], for which L* = Ly. Bilinear games a 
these, and numerous other 


bviously be done. Tt is 


and I are both large, 
he minimax loss to a 
ise have. 

2.4.1 suggests that, if 
unction emphasized in 
e a minimax mixed act 
its components, When 
n may well be differen- 
ly no advantage. The 
here. 

arge class of infinite ab- 
discovered by Kakutani 


re but a special case of 
rs seem to arise frequently in applications. 
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If L* = Lx for an abstract game, nothing at all can be gained by ad- 
joining mixed acts, as (12.3.5) shows. 

Finally, it may be mentioned that in many cases where an observa- 
tion x might be followed by a mixed derived act, the same, or nearly 
the same, consequences can often be realized by a pure act. Speaking 
a little loosely, this occurs whenever x has a continuous or nearly con- 
tinuous contraction y that is irrelevant, or nearly irrelevant, for then 
y can play the role in selecting a basic derived act that would otherwise 
be assigned to a table of random numbers. If, for example, x is con- 
tinuous, y(x) can be taken as the last few digits in the decimal expansion 
of x to an extravagant number of places. Again if, conditionally, x = 
{x1, +++, Xn} is an n-tuple of continuously, identically, and independ- 
ently distributed real random variables, y(x) may be taken as the per- 
mutation that ranks the 2’s in ascending order, provided that n! is 
fairly large: 10! should satisfy almost any need. j i 

A recent technical reference on the superfluousness of mixed acts in 
the presence of continuous observations is [D13]. : 

I have occasionally heard it conjectured that any mixed act made 
after the observation (in an observational decision problem) is wrong in 
principle. I would argue that the conjecture is mistaken thus: Any ob- 
servational problem that calls for randomization can be simulated, so 
far as its loss function L(r; i) is concerned, by a basic problem. A mixed 
act will be as appropriate to the basic problem as it was to the obser- 
vational problem from which the basic one was derived. In oe pag 
great variety of situations calling for mixed acts having nothing to do 
with choice of observation can be constructed, though they seem to be 
atypical in practice. Moreover, any basic problem san water 9 ae 
cur as the decision problem remaining after some particu By VANIGEE i 
an observation has been observed, so the situations just ee 
lead to closely related ones calling for mixed acts after ~~ ta d 

Less abstractly, consider a person choosing from at A ee 
French pastries. Even after extensive visual observ a a i oe, 
gation of the waiter, the person might justifiably introduce 


ble miy i his choice. ai. 4 

tiene aan oe conjecture that mixed acts are ee 
propriate after observations stems partly from aia A typical 
tend to make such acts inappropriate ar EEEE a efo eed acts 
cases and partly from justifiable dissatisfaction W1 i hee Por Exe 
tont bave Erori ome pee i pe Sg aii tests 

ion that ties in r ; z 

ee ste a A the tied observations at random may in many, 


or perhaps all, cases fairly be regarded with suspicion. 


CHAPTER 15 


Point Estimation 


1 Introduction 


This chapter discusses point estimation, and the next two discuss the 
testing of hypotheses and interval estimation, respectively. Definitions 
of these processes must be sought in due cours 
whatever notions about them you happen to have will afford sufficient 
background for certain introductory remarks applying equally well to 
both kinds of estimation and to testing. 

Estimating and testing have been, and inertia alone w 
that they will long continue to be, 
Their development has until recen 
verbalistic tradition, or outlook. For example, testing and interval 
estimation have often been expressed as problems of making assertions, 
on the basis of evidence, according to systems that lead, with high prob- 
ability, to true assertions, and point estimation has even been decried 
as ill-conceived because it is not so expressible. 

Wald’s minimax theory has, as was explained in § 9.2, stimulated in- 
terest in the interpretation of problems of estimation and testi 


e; but, for the moment, 


ould insure 
cornerstones of practical statistics. 
tly been almost exclusively in the 


eant interpre- 
For reasons discussed in 


The task of any such interpretation from one framework of ideas to 
another is necessarily delicate. In the present instance, 
ticular temptation to force the i 


, for these are 
sly maintained 


l Of course it is to 
tions of this chapter 


: and the next dem- 
etations do often translate verbalistic 
220 
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criteria into applications of the behavioralistic ones. In evaluating any 
such interpretations, it must be borne in mind that an analogy of great 
mathematical value may be valueless as an interpretation; correspond- 
ingly, what is put forward as mere analogy should not be taken to be 
an interpretation, much less branded as a forced one. For example, 
attention has already been called (in § 11.4) to the danger of regarding 
the analogy between the theory of two-person games and that of the 
minimax rule for objectivistic decision problems as an interpretation. 
In fact, minimax problems are of such mathematical generality that 
they arise, even within statistics, in contexts other than direct applica- 
tion of the minimax rule to objectivistic decision problems; a striking, 
though technical, example is Theorem 2.26 of Wald’s book [W3]. 

The literature of estimation and testing is vast; indeed it has, I 
think, been seriously contended that statistics treats of no other sub- 
jects. This chapter and the next two cannot, therefore, pretend to 
present a complete digest of that literature, even so far as it pertains to 
the foundations of statistics. For further reading certain chapters of 
Kendall’s treatise [K2] may be recommended as a key reference to the 
verbalistic tradition (Chapters 17 and 18 for point estimation; 19 and 
20 for interval estimation; 21, 26, and 27 for testing). Many newer 
aspects are treated in Wald’s book [W3]; and a recent review of testing 
by Lehmann [L4] is recommended. 


2 The verbalistic concept of point estimation 


Abstractly and very generally, but in verbalistic language (which is 
Necessarily vague), the problem of point estimation is this: Knowing 
P(e | B,) for every 7 and having observed the value x, guess the value 
à of a prescribed function, or parameter as it is often called, A(i) with 
values in a set A. Semi-behavioralistically this is, I think universally, 
understood to mean that a function 1 associating a value I(x) e A with 
each x (or possibly a mixture of such functions) is to be decided on, the 
function 1 being called an estimate (or, to be complete, a point esti- 
mate) of the parameter à. A problem of point estimation has, thus, 
some of the structure of an objectivistie observational problem; but, 
since nothing has yet been said about the income, or consequence, re- 
sulting from the act l in case B; obtains, it is at the moment impossible 
to advance criteria for the choice of 1. 


3 Examples of problems of point estimation 

It will now be well to present some examples after a few words of 
Preparation. For simplicity, A will henceforth generally be supposed 
to be an interval (possibly unbounded) of real numbers. If A(¢) = 
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A(z’) implies 7 = 7’, then \ rather than 7 can be used to index the par- 
tition; such an estimation problem is said to be free of nuisance param- 
eters. This usage corresponds to the fact that the čs can typically be 
represented as ordered couples (A, 6), where à is of course A(z) and @ is 
called the nuisance parameter; if @ in turn happens to be represented 
as an ordered n-tuple, ordinary usage calls 0 an n-tuple of nuisance 
Parameters. It must be recognized as atypical in estimation problems 
for ï or \ to be confined to a finite set of values, and often x is not so 
confined either. It will therefore be necessary to proceed heuristically 
into domains where the mathematically limited theory developed in 
this book does not rigorously apply. 

The specific estimation problems most commonly cited as examples, 
and most important in practice, are summarized in Table 1, together 
With their maximum-likelihood estimates, that is, estimates constructed 
in accordance with a rule to be defined in § 4. All but the last two ex- 
amples of Table 1 are free of nuisance parameters. 


4 Criteria that have been proposed for point estimates 


As a matter of fact, verbalistic treatments typically do give some 
inkling of the consequence of the act l when B; obtains. Thus, in the 
examples commonly cited, such as those in Table 3.1, A is a set of real 
numbers or a set of n-tuples of real numbers and, therefore, a set. of 
Objects between which the notion of proximity has some meaning. 
Work in the verbalistic tradition has made it clear in connection with 
Such examples that, if J = A(¿) for the B; that obtains, the guess is 
Considered perfect and that, roughly speaking, it is considered rather 
Poor if J is far from }. Oo 

In spite of the apparently hopeless indefiniteness of estimation prob- 
lems even as thus formulated, various criteria, or desiderata, for esti- 
mates have been suggested. A list of these criteria, intended to be es- 
Sentially complete, is now presented. Each item is annotated and il- 
lustrated to make its meaning clear, and sometimes to call attention 
to related criteria not explicitly listed; motivation and criticism are, 
however, deferred until later sections, where they are treated in connec- 
tion with explicit hypotheses about the consequences of misestimation. 

No attempt is made to include criteria like intellectual simplicity or 
facility of computation that depend not only on the estimate but also 
On the capabilities of the people who contemplate using it. The list 
IS In a sense logically inhomogeneous. For example, no one really con- 
Siders it a virtue in itself for an estimate to be a maximum-likelihood 
estimate (Criterion 4); rather, it is believed that such estimates do 


typically have real virtues. 
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It has, to begin the list of criteria, been suggested by one person or 
another that: 


1. If y is sufficient, nothing is to be lost by requiring the estimate 1 
to be a contraction of y. 


It will be instructive to bear in mind that necessary and sufficient 
statistics of the examples (a)-(f) in Table 3.1 are, respectively, x, 2, 


x, Dear, G, De), (A X z’). 
2. If, of two estimates | and I’, 
(1) EU =OP | B) < EW — A0? | B) 
for every 7, with strict inequality for some i, then 1 is better than I’. 


There are countless variants of this idea. In particular, the square 
of the difference may be replaced by any other positive power of the 
absolute difference. Again, (1) may be imposed at only one value of T 
if 1 and l’ are subjected to some other condition, freedom from bias 
(Criterion 6 below) being the popular one, 

Example (f) gives rise to a good illustration of this criterion, which 


is also interesting in a later connection. Letting Q =p >D x? — nz, 
it is well known that E(Q| p, o?) = (n — 1)o2 and that £(Q? | u, o°) 
= (n? — 1)o*. Therefore 


(2) Ella — oP | u, 6) = fa2(n? — 1) — 2a(n — 1) + Lot 


1 jj 2 
S\N eea a 
(( n+i ( +H 
2st 
Enpi 
for all real æ, with equality if and only if a = (n+ 1)7, omitting the 
pathological but trivial case that n = 1, By the criterion in question 
Q/(n + 1) is therefore better than any ot f 


( f tter her estimate of the form aQ, 
including the maximum-likelihood estimate Q/n and the unbiased es- 
timate Q/(n — 1). 


3. If, of two estimates 1 and E; 
8) Pa <Ia) -AO < @|B)> Pa <p 


for every non-negative € and e and for 
for some ¢, €9, and some i, then 1 is bett 


(2) = MÒ) < @| B) 


with strict inequality 


every 7, 
er than I’ 


Pn - -e = 4 or 
A a aa 
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Acceptance of this criterion is obviously implied by acceptance of 
Criterion 2, of which it may therefore be regarded as a skeptical coun- 
terpart; formal demonstration of a much more general assertion will be 
given in connection with (5.2—4). The criterion implies, for example, in 
connection with (c) of Table 3.1 that £ is superior to any other weighted 
average of the v/s. A more interesting example will be mentioned in 
connection with Criterion 5. 

That modification of Criterion 3 in which it is concluded only that 
Lis at least as good as l’ is of some technical interest. Incidentally, if 
equality held identically in (3), there would presumably be nothing to 
choose between the two estimates by any reasonable criterion, for they 
would then both have the same system of conditional distributions. 


4. A maximum-likelihood estimate is often a rather good estimate. 


A maximum-likelihood estimate is an estimate 1 such that, for some 
function i of x, U(x) = A(i(x)) and 


(4) P(x | Bua) = P| Bi 


for every i and x. In many natural problems there is only one maxi- 
mum-likelihood estimate. Taking into account the analogy between 
probabilities and values of probability densities, the reader should verify 
that the estimates listed in Table 3.1 are indeed the unique maximum- 
likelihood estimates of the problems to which they refer. When there 
is a unique maximum-likelihood estimate, it is obviously a contraction 
of the likelihood ratios and, therefore, of any sufficient statistic; which 
fits neatly with Criterion 1. 


5. A good estimate should have the same symmetry as the problem. 
More precisely, if a permutation T of the 2’s and the 2’s is such that 
(5) P(Tx | Br;) = P(e | Ba), 


and such that A(z) = \(2’) implies A(Ti) = M(T7’); then 1 should be 
such that, if I(x) = d(2), (Tx) = (Ti). 

For example, adopting also Criterion 1, a good estimate for u in (c) 
may be sought of the form l(@). Symmetry then dictates l(@ + a) = 
U%) + a and U(—#) = —l(ē); in short, U(@) = 4. 

The same conclusion can be drawn for (e), though with a little more 
trouble. The criterion applied to (f) leads to estimates of the form aQ. 
The constant « might be fixed by appealing, for example, to Criterion 
2, 4, or 6. These alone give three slightly different determinations— 
a~! = (n + 1), n, and (n — 1), respectively. 


226 POINT ESTIMATION [15.4 


Again, it can be shown for Examples (c) and (e) that, among all es- 
timates satisfying Criterion 5, 7 is best according to Criterion 3. 


6. It is desirable that the estimate be unbiased. 
An estimate 1 is called unbiased, if and only if 
(6) EU| B) = XG) 
for every i. 


It is easy to verify that the maximum-likelihood estimates of (a)-(e) 
in Table 3.1 are all unbiased; that of (f), however, is not, for E(Q/n | p, 
o°) = (1 — 1/n)o? instead of o?. Again, if lis a maximum-likelihood 
estimate of \, ¢! is a maximum-likelihood estimate of e. But, if 1 is 
not definitive, and 1 is an unbiased estimate of à, e! is not an unbiased 
estimate of e, as Theorem 1 of Appendix 2 implies. 

7. If P1—2@| < |P —x@||By > 1/2 for every å, then 1 is 
better than I’. 


Any resemblance between this criterion and Criterion 3 seems to be 
dispelled by the following example. Suppose that, for every 7, P(1 — A(é) 
=a,’ -Ni =b | B;) equals 2/11 if a and b are integers such that 
0<a<b< 2, equals 5/11 if a and b are 2 and 0 respectively, and 
equals 0 otherwise. According to Criterion 7, Lis better than I’, be- 
cause 6/11 > 1/2; but, according to Criterion 3, l’ is better than 1, 


because 5/11 > 4/11 and 7/11 > 6/11. The example can easily be 
modified to suit any taste for symmetry and continuity. But, if 1 and 
l are conditionally independent, (which is not a natural assumption), 


and 1 is better than I’ according to Criterion 7; then, as may easily be 
shown, I’ cannot be better than 1 by Criterion 3. 
The list of criteria is here interr 
planation in preparation for two c 
The approach to certainty treat 
part in the theory of estimation. 


upted by several paragraphs of ex- 
oncluding criteria. 

ed in §§ 3.6 and 7.6 has its counter- 
est n. In particular, if x(n) = {x1, +++, Xn} 
is an n-tuple of conditionally independent and identically distributed 
observations, there will typically exist Sequences of estimates 1(n) based 
on x(n), such that 

(7) lim P(| e(n), n) — O| < e| B) =1 

n> @ 

for every positive e and every i. A sequence of estimates satisfying (7) 
relative to any sequence of obser 


vations x(n) (not necessarily n-tuples 
of conditionally independent observations) is called consistent. 
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The condition of consistency is often realized in a very special way, 
namely that the error [I(x(x); n) — \(2)] is, for every B; and for large 
n, practically normally distributed about zero with variance inversely 
proportional to n. More formally, a sequence of estimates may be 
such that 


W (Aen =O al Bi) z aon [1e 


n> s a(i) 


for every i and a, where o(?) is some positive function of 7; it is then 
said that n™[I(a(n); n) — (2)] is asymptotically normal about zero with 
asymptotic variance o7(7). If, in addition, for every 7, o*(2) is not less 
than a certain function, the differential information, to be defined in 
§ 6, then the sequence 1, is called efficient. 

There is a possible pitfall in connection with the idea of asymptotic 
normality. Though (8) implies that, for large n, the distribution of 
the error is, in a sense, almost the normal distribution with zero mean 
and variance o2(z)/n, it does not imply that the mean of the error is 
close to zero, or even finite or well defined. Similarly, the variance of 
the error may be much larger than o(i)/n, infinite, or ill defined; but 
it cannot, for large n, be smaller than o°(i)/n by a fixed fraction or less. 

Much literature on estimation has concentrated on sequences of es- 
timation problems in which x(n) is an n-tuple consisting of the first n 
elements of an infinite sequence of conditionally independent and con- 
ditionally identically distributed random variables or, as it will be 
called in the present chapter, a standard sequence; because these are 
the simplest examples of sequences of increasingly informative obser- 
vations. Examples (c)-(f) in Table 3.1 refer directly to standard se- 
quences; the binomial distributions (a) can be regarded as the distri- 
bution of the sufficient statistic >» x; of the standard sequence x(n) 
in which each x; takes the values 1 and 0 with probabilities p and 1 — p, 
respectively (cf. Exercise 7.4.1); again, if each x; is Poisson-distributed 
with parameter u, then >» x; is sufficient for x(n) and is itself Poisson- 
distributed with parameter nu. Thus, all the examples in Table 3.1 
give rise more or less directly to examples of standard sequences. 

In speaking of standard, and occasionally of other, sequences the 
ellipsis of referring to a sequence of estimates simply as “an estimate” 
has been widely adopted, so one reads recommendations that “an es- 
timate” should be consistent or efficient. This ellipsis, though often 
convenient, sometimes proves dangerous. It distracts from the fact 
that a person is called upon to make an estimate, not a sequence of es- 
timates; so that the question of what constitutes a good sequence does 
not arise. Again, it makes one feel that if an estimate, say lig, has been 
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defined for x(13), then the definition of ly, is thereby implied. One for- 
gets, for example, that “the” average of n observations is a whole se- 
quence of statistics, a sequence singled out by human tastes and in- 
terests, rather than by any mathematical necessity. In short, the 
ellipsis establishes the atmosphere of the logically nonsensical (though 
perhaps psychologically revealing) questions on intelligence tests such ast 
“What are the two missing terms in the Sequence ___ 18281828?” + 

The recommendations of consistency and efficiency quoted above can 


be added to the numbered list of suggestions, in a form that avoids the 
ellipsis: 


8. If each I(n) is a good estimat 


e for the corresponding x(n) of a 
standard sequence, then the sequence: 


e I(n) is consistent. 
The sequence of maximum-likelihood estimates of the sequences of 
problems (a), (c)~(f) are consistent; and, for the Sequence of problems 


of estimating from an observation Yn Poisson-distributed with parame- 
ter nu, the maximum-likelihood estimates y,/n are consistent. 


for example, be multiplied by (1 + 


ency. Again, the sample medians f 
different from 


n~”) without destroying consist- 
are in (c) a consistent sequence 
the sequence of maximum-likelihood estimates. 


9. Under the hypothesis of Criterio 


n 8, the sequence 1(n) is efficient, 
at least if any e 


ficient. sequence of estimates exists, 


are. The asymp- 
totic variances and certain other interesti 


interesting quantities associated with 
these six sequences are presented in Table 1 


1, the expected 


pproach the asymp- 


totic variance of n% times the e t five examples the 


relations mentioned hold, indee limit, but exactly, 


for alln. All six examples are rather Special, or magical, but the limit- 
ing relations just mentioned may fairl 


generality, though they are not (as has 
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maximum-likelihood estimate of | u ja for u =Æ 0; this sequence of es- 
timates is efficient; and n*(| z|—! — | a |75) is asymptotically normal 
about zero with asymptotic variance n~?; but the other three entries 
for Table 1 are infinite in this example. 


TABLE 1. EXAMPLES OF BEHAVIOR OF MAXIMUM-LIKELIHOOD ESTIMATES 


Asymp- 
n X expected totic 
Sequence Mean n X variance square of variance 
error ofn xX 
error 
(a) p pa pq Pq 
Poisson un H m H H 
(e) u i) il k 
(d) = 204 2g 204 
(e) p a o? 2 


2 
(f) (1 -i)e 2(1-5)e (2-5) 2Qo4 


As in the case of consistency, where there is one efficient sequence, 
there are many, but efficiency is, of course, a much more restrictive 
property than consistency. For example, multiplication by (1 -+ n~) 
typically destroys efficiency, though multiplication by (1 + m7) never 
does. Again, the consistent sequence of medians mentioned under Cri- 
terion 8 is not efficient. Indeed, it is well known of that sequence that 
the sequence of errors times n” is asymptotically normal about zero 
with asymptotic variance 7/2 rather than 1. 


5 A behavioralistic review of the criteria for point estimation 


It is time now to introduce the notion of consequences, or (equiva- 
lently, I believe) of loss, thereby interpreting estimation problems as 
decision problems. Let it be said then that an estimation decision prob- 
lem is an observational decision problem with the following distinguish- 
ing feature. There is a one-to-one correspondence between the basic 
acts f and the values attained by a real-valued function (2), such that 
L(f; i) = 0, if f is the act that corresponds with (2). It is simpler, 
More suggestive, and harmless to let the number l that corresponds to 
f replace f itself in all further discussion of estimation decision problems. 
To illustrate the new notation, it may be said that L(l; i) = 0, if l = d(2). 

I believe that any situation ordinarily said to call for (point) estima- 
tion can be analyzed as an estimation decision problem. For example, 
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estimating how much paint will cover a wall may, depending on cir- 
cumstances, mean deciding: how much paint to buy, what to bid for a 
contract, or what number to enter in a guessing pool. Under each of 
those interpretations there will be zero loss, if and, typically, only if 
the estimate is “correct,” as one says. 

The consequences of an estimate may, like those of many real life 
decisions, be difficult to appraise. It is hard to say even in relatively 
concrete situations what it will cost to misestimate the speed of light, 
a particular mortality rate, or the national income. If, to revert to an 
example already discussed, the estimate is to be published somewhere 
for the use of whoever has a use for it, the consequences of publication 


may seem beyond all reckoning. None the less, I reaffirm the convic- 
tion that the concept of consequence measured in income or loss is 
valuable in dealing with such situations, 
ment of estimation will illustrate. 


admissibility and the minimax rule to such classes of estimation de- 
cision problems. 


a) LG) < LU; i) 
for MZ) <1 <V and for AZ) > l>r., 


a Situations to which (1) fail 
to apply can readily be imagined. William Tell, for example, rA ssi 
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mating the angle by which to elevate his cross-bow for the apple shot 
might have preferred a downward error of 10° to one of 1°; but such 
circumstances seem exceptional. Furthermore, it is usually justifiable 
to assume that strict inequality holds in (1), though there are many 
exceptions in which, for example, ‘fa miss is as good as a mile” or one 
hit is as good as another. 

As is, I think, intuitively evident, when strict inequality holds in 
(1), Criterion 3 is simply an application of the principle of admissibility. 
That conclusion can be shown in complete generality without serious 
difficulty, but, in compliance with the usual mathematical limitations 
of this book, it will here be shown only under the assumption that x 
is confined to a finite number of values. 

What is to be shown is this: If 1 and I’ are a pair of estimates satisfy- 
ing the hypothesis of Criterion 3, and if (1) holds with strict inequality; 
then L(l; i) — L(I’; i) < 0 for every i, with strict inequality for some 
i. To begin the proof calculate thus: 


(2) L;i) — LU) = DLC; DPU) = 1| B) — PUE) =| Bd) 
l 


= D LG QU; 4) 
t 

= E LGARGY+ E LU; DRU i), 
L<X(i) l>a@) 


where the definition of Q(l; i) is clear from the context, and where it 
has been taken into account that L(A(¢); i) = 0. It will be shown that 
both sums in the last part of (2) are non-positive and that for some ¢ at 
least one of them is negative. Focus, for definiteness, on the second 
sum. Let Ig = d(2) and ly, l2, --- be, in order of increasing magnitude, 
the values of J > A(z) for which Q(l; i) #0. With the abbreviations 
Lk) =p: Lly; i), A(k) =p L(k) — L(k — 1), and Q(k) = pt (lk; i), 


the sum to be investigated is 


(3) LHe) = rem UY Ae) 
0<k 0<k O<k’ Sk 
=> Ak’) X Q). 
0<k’ kek’ 


(This rearrangement may seem bizarre on first encounter, but it is 
Widely used in mathematics generally and is in fact an exact analogue, 
for sums, of the more familiar integration by parts, for integrals.) It 
follows from (1) read with strict inequality that A(X) > 0; and it fol- 
lows from the hypothesis of Criterion 3 that Q(k) < 0, and that some 
Q(k)—or an analogous term associated with the first sum in the last 
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line of (2)—is strictly negative for some 7. This completes the deduc- 
tion of Criterion 3 from the strict form of (1) and the principle of ad- 
missibility. Essentially the same argument leads from (1) as actually 
written to the modification mentioned in the note under Criterion 3. 

A very slight strengthening of (1), together with the minimax rule, 
provides a widely applicable justification of Criterion 8 (consistency) p 
as will now be explained. Suppose that (1) not only holds but also is 
strict, if Z = A(ċ¢); that is, in addition to (1) suppose only that LV; i 
> 0 for all l’ # (2). In this context, let x(n) be a sequence of obser- 
vations such that the minimax L*(n) of the corresponding estimation 
problems approaches zero with increasing n; then any sequence of mini- 
max estimates I(r) is consistent. Indeed, if the sequence I(n) is not 
consistent, then, for some 7, and some positive ¢ and ô, 


(4) P| Kenin) = AG) | > el B) > 6 
for some arbitrarily large values of n. This implies 


(5) L*(n) > Ln); i) > 6 min {LAG + 61), L(G) — «; i)} > 0, 
which contradicts the hypothesis. 

Turn next to Criterion 5 (symmetry). Suppose that the estimation 
decision problem has symmetry in the sense defined under Criterion 5. 
That does not in itself really call fi 
But, if L also has the symmetry, that is, if L(A’); 7) = L(MT%’'); Ti) 
for all appropriate T, then the di 
gests that typically there is, 
minimax estimate. Whether L h 


near A(z). This condition, 
void of content as it may seem to a reader brought up in the tradition 
that it makes no practical difference Whether a function has a few sharp 
corners because they can always be rounded off with almost no change 
in the function. If, for example, Li; i) is for all practicable purposes 
equal to |Z — à |; then L cannot be regarded as di 
once when / = }, and the theory to be developed here for twice differen- 
tiable L(l; 7)’s in the presence of extens; 
It will therefore be useful to digre; 
illustrating how corners can arise and the phenomena that tend to round 
them off. 

Suppose that a person must estim: 


ate the amount of shelving for 
books, priced at $1.00 per foot, to b 


e ordered for some purpose. It is 
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possible that the following economic analysis of the situation would be 
sufficiently realistic. The person holds every foot of shelving less than 
the number of feet, A, of books to be shelved to be worth $a, a > 1, 
but superfluous shelving he holds to be worthless. Formally, 


(6) LUGN =(e@-DA-Y  forl<d 
= (l—)) for! > X. 


There is then a corner, or kink, at l = ; so differentiation, even once, is 
impossible. 

But the following analysis is much more likely to be sufficiently real- 
istic. The urgency of the shelving of the books is variable. Some would 
be worth shelving, even if the cost of shelving were very high; at the 
other extreme, there are some that would not be worth shelving unless 
the cost were very low. More fully, the value of J feet of shelving is a 
function ¿(l) that presumably has the following features. It is mono- 
tonically increasing, strictly concave, and twice differentiable in l; 
i(0) = 0; i(%) < œ; #"(0) > 1. The income attached to ordering L 
feet of shelving, at the price $1.00 per foot, is clearly 


(7) I;i) =i) l. 


Tt is maximized at the one and only value A for which di(\)/d\ = 1, so 
that 


(8) L(G ù) = EA) — A] — FQ) — i, 


which is of course twice differentiable in 1. 

The moral of these two possible economic analyses of one example is 
of wide applicability, as is well known among economists. Where a 
superficial analysis suggests a kink, or even a discontinuity, in an in- 
come function, deeper analysis will often show that the function is 
smoothed out by various economic phenomena such as the inhomo- 
geneity and the mutual substitutability of commodities. 

To return from the digression, if L is twice differentiable in J (at 
least when J is close to A), L can be expanded in a Taylor series thus: 


(9) LG) =LA;)+(- NF = 1 i) 
l=) (i) 


2 


1 ð 
r =i)? N 
+50 V (l; 2) 


, FAU- NP, 


=) (i) 


where, following standard usage, o((l — ))?) is a function of J and t not 
necessarily the same from one context to another, such that o((l — d)®) + 
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(L — d)? approaches zero as l approaches )(z) for fixed i. The first term 
on the right side of (9) vanishes by the definition of estimation; the 
second must vanish also, for otherwise L could be negative. Therefore, 
ð 


2 
LGD) + ol — d)?) 
t 


1 2 
(10) LU) = au = z 


= (l = MG) ali) + (0 — y3), 
where a(z) is defined by the context. 


In view of (10), it is plausible that L may, in many problems where 


estimates of great accuracy are possible, be supposed to be practically 
of the form 


(11) Lll; i) = (l — r@))Pa(i), 
where a(i) > 0 for every i. This does not exactly mean that a reason- 
able L can be closely approximated by functions of the form (11) for 
alll. In particular, the absurd assumption that L is unbounded (which 
such approximation would t: 


Ypically imply) is not to be made. It means, 
rather, that under favorable circumstances (11) may lead to a reason- 


ably good evaluation of L(l; i). In so far as the form (11) can be sup- 
posed adequately to represent L, Criterion 2 is obviously an applica- 
tion of the principle of admissibility. An interesting discussion and 
application of (11) is given by Yates [Y2]. 


6 A behavioralistic review, continued 


have been discussed in behavioral- 
‘ SPEU) hypotheses, each has been found to 
have considerable behavioralistic justification. Criteria 4 and 9 also 


ones remaining, they do not seem to me to have any serious justifica- 
tion at all, as will be discussed in s 

Criterion 4, the recommendation of maximum-likelihood estimates, is 
of extraordinary interest, for, of all the criteria of the verbalistic tradi- 
tion, it is essentially the only one th 
most every estimation situation of 
section demonstrates that 
maximum-likelihood estim 
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sumed for mathematical simplicity that each observation under discus- 
sion is confined to a finite number of values, each having positive prob- 
ability for every element of whatever partition is under discussion. 

If B; and Bj are elements of a partition, not necessarily finite, and x 
is an observation, say, in the spirit of (3.6.11), that the information of 


Jj relative to i for the observation x is 
i 
B;) = —E | log—| B: )- 
fi 


The expression of J in terms of likelihood ratios is important, especially 
for the extension of the discussion to more general observations than 
those contemplated here. The reader should, therefore, try to bear in 
mind that the whole discussion could be carried on in terms of likeli- 
hood ratios; I refrain from so doing only for momentary reasons of no- 
tational convenience. The theory of J can conveniently be presented 
in a series of exercises. 


P(e | B) 
P(x| Bi) 


(1) JG, 9; x) =pr 2 (tog 


Exercises 


la. If y isa contraction of x, then J (i, j; x) > J (i, j;y). With equality 
when? Hint: 


P(«| B) PU | By) 
(2) -z (i AEE EE pe 
Ë Pel B) PU] Bi) 
lb. J(i, j; x) > 0. With equality when? 
2a. If xı, --+, x, are conditionally independent, then 
(3) JG, j; £u ++ Xa) = DIG ji xa) 


2b. If in addition the x,’s are conditionally identically distributed, 
then 


(4) Ji, j; Zi, ++, Xn) = nJ (i, j; 1)- 


It is interesting to evaluate the information J (A, A + A); x) where À 
and A + A} are two closely neighboring values of the parameter of an 
estimation problem, supposed, for simplicity, to be free of nuisance 
Parameters. If P(e |A) is continuous in A, it is almost obvious that 
F(X, A + AX; x) approaches zero as Ad approaches zero. If P(x |A) is 
differentiable in à, it is easy to show further (considering that J is non- 
Negative) that even J(A, A + Ad; x)/Ad approaches zero as AX ap- 
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proaches zero. But in this case much more can and will be shown, 
namely, 


i J(A, A + AX; x) 21 
©) ne 0 AN? 2 


-na [CZE] 


The function H is generally, following Fisher, called information, but 
here we had better call it differential information. Chronologically, as 
explained at the end of § 3.6, the concept of differential information is 
older than that here called simply information and of which it is, ac- 
cording to (5), a limiting case. 

The demonstration of (5) begins with the consideration that 


(6) log (1 + t) = t — 42 + o(?), 
Therefore, 


H(A; x) 


P(w|+ Ar) P(w|d+ Ad) — P(x|r) 
D e= ~ 
(7) log PEIN og (1+ P(e] a) | 
a [zel à + Ad) — P(e | a 
P(x| 2) 
_1{P@|A+ ar) — Pejy]? 2 
A Pel» | ee 
Since the expected value 


i ] given ) of the term in the second line of 
(7) is easily seen to be exac 


tly zero, it will be tactful to leave that term 
alone; but the second may be approximated thus: 


a ee [A + Ad) = P| Le ie aP(x| A) 
Pejn ~ UP@Ty a 


2 
= sefere) + 0(An?). 


+ aa} 


Therefore, 


(9) JA, A + AN; x) = SH; x)A? =f o(An2), 
which establishes (5). 


More exercises 


3. If the kth derivative (k > 0) with respect to à of P(x |X) exists 
for every x, then 


1 ar gk 
(10) (oprel | a) = S (2 rel ») =0. 
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4. If the requisite second derivative exists, then 


3? 
(11) H(\;x) = -1(S log P(x | d) | x): 
5. If y is a contraction of x (and H(A; x) is well defined), then H(A; y) 
< HQ;x). 


Remark: The inequality is obvious in the light of Exercise 1a and the 
first part of (5). But it can also be derived from the following applica- 
tion of Theorem 1 of Appendix 2, which is useful in the next exercise. 


(12) l 1 Pol ze( 1 aP(x|) >) 
P(y|d) a P|) a 
KA r), 


tag nt 


5 =. 0 
with equality for every y and A, if and only if an log P(x | A) can be ex- 


pressed as a function of y and à alone. 

6a. If y is a contraction of x, H(\; x) = HQ; y) for every N; if and 
only if y is sufficient for x. 

6b. H(\; x) = 0 for every A, if and only if x is utterly irrelevant. 

7a. If xı, +++, Xn are independent given à, then 


(13) Hii ey ¥n) = DL Hi ®)- 


7b. If, in addition, the x,’s are identically distributed given \, then 
(14) H(X; Xi ++) Xn) = NH; x1). 


8. If 1 is a real-valued contraction of x, and H(A; x) is well defined, 
then 


(a) 
d ð log “oe | d) 
1 = Fi ye eee ae a 
(15) = nily = (ue ) | x): 
(b) 
(16) EU- API NHA;) = [= E| vf 
4 2 = dy ? 


With equality if and only if 
ð 
(17) a ee PU|N = (l — Nk 


for some constant k. Hint: Use Exercise 3 and apply the Schwartz in- 
equality to (15). 
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(c) If H(A; x) > 0, then 


d 2 
(18) EM- dP? |) > [= E| y} /HA; x). 


Exercise 8c is an important, and now famous, inequality. It, together 
with its n-dimensional generalization, has been called the Cramér-Rao 
inequality because of its independent publication by Rao and Cramér 
in 1945 and 1946 respectively (see [H6]). But the name is not at all 
well justified historically. Fréchet presented the inequality in 1943 
[F8], and Darmois extended Fréchet’s inequality to n dimensions, ve 
least for unbiased estimates, in a publication [D1] not later than Rao’s. 
The inequality has also, though I think erroneously, been attributed to 
an early paper by Aitken and Silverstone [A1], and to one by Doob 
[D10]. My point is, of course, not to give a definitive history of the in- 
equality, but merely to suggest that for the time being an impersonal 
name would be better. I tentatively propose calling it the information 
inequality. Some recent references pertinent to the information in- 
equality and other topics treated thus far in this section are [W15], 
[M5], [C6], and [H6]. The techniques used in the remainder of this 
section, which revolve around the information inequality, were pub- 
lished posthumously by Wald [W5]. 


The information inequality has an important bearing on application of 
the minimax rule to estimation, of which the following theorem may, 
in view of (5.11) be taken as a first illustration, 


THEOREM 1 


Hyp. 1. For every \ in a closed interval of length ô, H(A, x) < H, 
where H is a constant. 


2. lis a real-valued contraction of x, 


2\-2 
Conct. For some 2 in the interval, E((1 — WINS (m + 3) . 
Proor. Suppose that the theorem is false. Then according to Ex- 
ercise 8c, 
n( gN Ja 
(19) 1>H*|H44-) s P| 
ô dy 
for every À in the interval, Therefore, 
d y 2-4 2 
(20) ŠA- UIN] > 1 (as 2) a 
dy ô (8H” + 2) 
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for every \ in the interval. Therefore, at one end of the interval or 
the other, 


= 
(21) [x= Bal] > 5 = (w+) 
(6H +2) 2 ô 


This leads to a contradiction through the well-known inequality 
(22) EU- APIA) > E0 -aAlNe = la- Eala) P, 


which can be derived as a direct application of Theorem 1 of Appendix 
2, or of the Schwartz inequality, or of the useful identity 


(23) EU- NJA) = VALN + EU- ala. @ 
In the remaining portion of this section, let it be understood that: 


1. The x,’s are an infinite sequence of observations that are, given À, 
identically distributed and independent. 

2. x(n) = {x,, ++; Xn} for n = 1, 2, =+. 

3. 1(n) is a real-valued contraction of x(n). 

The contraction 1(n) is to be thought of as an estimate of A based on 
observation of x(n). In the spirit of the minimax theory it is really 
mixed, rather than ordinary, estimates that should be treated here. 
But this entails no essential change in the following discussion once it 
is recognized that a mixed estimate is, in effect, an ordinary estimate 
based on observation of y(n) = ps (1(n), x(n)), where x(n) is sufficient 
for y(n), so that H(A; y(n)) = HA; x(n)) for all à. 

4. eand ô are positive numbers. 

_ 5. Ao is a closed interval of length ô contained in the range of à and 
including a given value Xo. 


The next theorem shows that, if L(l; A) is of the form (5.11), L(U(n); 
à) cannot ordinarily be kept much smaller than a(ào)/nH (Ao; x1) for 
large n, even in a small interval about do. 


Turorrm 2 If H(A; xı) is continuous and positive at Xo, and if 
a(X) is a non-negative function continuous at Xo, then, for sufficiently 
large n, H((I(n) — d)2a(d) |à) > (1 — €)a(ro)/nH (Ao; x1) for some 
A € io. 


Proor. There is no loss of generality in supposing that e < 1 and 
Ao such that, for A¢Ao, a(A) = alo) (1 — ©% and HA; m1)? < 
Ho; x1)* [1 + (1 — 2)~4J/2. Using Exercise 7b, 


(24) HA; x(n))4 = nH; x1)* > E HO xl + — 74] 
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for XeAo. By Theorem 1, if n > 16/8 H(A; x) — rr — I 
then [n , aa at 
(25) E((Mn) — »)?]) > |g 7 Qo; x) + (1 07] + a 


—2 


— ə% 
s079 


~~ nH (Xo; x1) 
for some À € Ap. @ 


The next theorem extends T 


heorem 2 to practically any loss function 
that is twice differentiable in 1 


for l and À close to Xo. 
THEOREM 3 


Hyp. 1. H(\; xı) is positive and continuous at Ao- 


1a 
2. a(A) =r- TAUERN is continuous at `o. 
2 al lad 


3. Inequality (5.1) holds for \ in Ao. 


Conci. For sufficiently large nL 


Un);A) > (L — ea(do)/nH (No; x1) 
for some \ e Ao. 


Proor. It may be Supposed without loss of generality that e < 1; 
and that, for 1, A edo, Ll; A) > (1 — a(l — IN 
It may also be Supposed that I(x; n) 
suffice to prove the theorem for a ne 


defined to be the number in Ay closest to U(x; n), 


from the fact that Lin); A) < L(\(n); d) for x e Ao. 


COROLLARY 1 If LU A) satisfies 
respect to l continuous in A for ev 
to A, and if HOA 
large n, 


(5.1) and has two derivatives with 
ery À and for every l sufficiently close 


5 x1) is continuous and positive, then, for sufficiently 


(26) L*(n) > (1 ~ 6) SUP a(A)/nH (A; x,), 
` 

where L*(n) is the mini 

derived from Lil; X) and x(n), 

finite, in which case nL*(n) 


Of course, it would 


be enough to assume only 
are well behaved at so 


that L(l; A) and HA; x1) 
me sequence of values of À 


on which the supremum 


15.6] BEHAVIORALISTIC REVIEW OF ESTIMATION 241 


in question is approached. In particular, if the supremum is actually 
attained at some \, they need only be well behaved there. 

Now, turning to the sequence of maximum-likelihood estimates, let 
them be denoted for the moment by I(x). It is known that under 
rather general hypotheses n*(i(n) — A) is asymptotically normal about 
zero with asymptotic variance 1/H(A; %).{ This suggests, and ex- 
amples tend to confirm, that, under some supplementary conditions, 


(27) lim nE(((n) — d)?) = . 

nie ( H(;X1) 
Indeed, one set of conditions implying (27) is stated in [W5], but one 
that seems difficult to apply. It can be shown that (27), together with 
the usual asymptotic behavior of 1(n), implies 


= alà) 

(28) ee nL(1(n); X) = Haim) x’ 
provided, for example, that L(l; A) is bounded for each \ and that the 
second derivative of L(l; A) with respect to l exists when l=. Easily 
applied rigorous theorems implying (28) much less (27) do not seem to 
have been formulated yet; but examples suggest that, under conditions 
general enough for many applications, (28) actually does hold uni- 
formly, in the sense that, for n sufficiently large, 

- A A (1 + JaA) 
(= dad) < LÜ; N < ————— 
nH (N; X1) nH (a; X1) 
for, all A simultaneously. If (29) holds, then, in view of Corollary 1, 
1(n) is nearly minimax for large n, in the sense that 


(30) L*(n) > (1 — ©) sup L(i(n); X). 
x 


(29) 


Good examples can be based on (a) of Tables 3.1 and 4.1, letting 
Li; p) be any loss function having two continuous derivatives in l 
throughout 0 <1, p <1. In particular, the example discussed in 
§ 13.4 arises, if L(l; p) = (l — p)®. It can be argued that the phenome- 
Non discussed in connection with that example is probably not rare; 


t Some key references for the asymptotic behavior of I(n) are [K2], [C9], [L3], 
[W16], [N4]. The literature on this subject is extraordinarily complicated. There 
are acknowledged mathematical mistakes in some of its most sophisticated publica- 
tions; others prove much less than any but the most attentive reader would be led 
to Suppose; few give an adequate statement of their relations to their predecessors; 
and those that make serious pretentions to rigor involve complicated hypotheses. 

or documentation of this lament see [N4], [W4], and [L3]. 


242 POINT ESTIMATION [15.6 


ini ; \) is, judging from examples, often 
because, for minimax 1(n), L(1(n); A) is, judgi fi 
constant and, therefore, nearly equal to sup a(A)/nH(A; x1), but L(1; X) 
A 


follows the rise and fall of a(A)/nH(A; x1). 

ga now to Criterion 9, efficiency. It seems difficult to defend the 
criterion as it has been defined in connection with (4.8); for what vir- 
tue is there in the asymptotic normality required by (4.8)? It is per- ' 
haps noteworthy that the sequence of minimax estimates, p,(n), aris- 
ing in connection with § 13.4 does not satisfy (4.8). Indeed, (13.4.3) 
implies that ’4(p,(n) — p) is asymptotically normal not about zero, 
but about (4 — p). 

It is my impression that the essence of the efficiency concept resides 
not in asymptotic normality, but in the overall behavior of the mean 
square error of a sequence of estimates. I therefore propose tentatively 
to modify the definition and to call a sequence of estimates 1(n) effi- 
cient, if and only if its mean square error behaves at least as well as 


can typically be expected for a sequence of maximum-likelihood esti- 
mates. 


Formally, I propose to call l(n) efficient, 


if and only if, for n suffi- 
ciently large, 
1 
(31) B(n) — xp) < +9 
nHO; x) 
for every À simultaneously. 


I think the main objection that i 
definition is associated with the 
theoretical, and perhaps also of practical, importance (31) is not satis- 
fied by any sequence though the maximum- 
likelihood sequence is “official” sense. In such a prob- 
lem, are the maximum-likelihood estimates not as good for all practical 
ough their variances were actually 


ctually no substitute for 
The next paragraph is devoted to an ex- 
ample illustrating the inadequacy of asymptotic variance as a measure 
of asymptotic loss. It can be skipped without loss by anyone not in- 
terested in such technicalities. 

The best example I have been able to construct is derived from a se- 
quence of observations that is not a Standard sequence. Whether the 
interesting features that it exhibits can actually be realized by standard 
sequences, I do not know; but the example will do to illustrate the is- 
sue. Let y(n) be any real random variable subject to the density 
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n6((y — d)n’4; n), defined thus: ¢(z; n) is the standard normal density 
inside the interval [—4(n), 6(n)], ô(n) being such that the standard 
normal probability of this interval is (1 — n~); o(z; n) = 2778(2n)/4 
for 6(2n) < | z| < n”; 6(¢; n) is so defined elsewhere as to be a sym- 
metric positive probability density with the first two moments finite, 
with a bounded derivative approaching zero like z~* with increasing z, 
and with unique absolute maximum at z = 0. It is evident that n” 
(y(n) — X) is asymptotically normal about zero with unit variance. 
The information H(A; y(n)) is well defined (even according to the strict 
conditions imposed by Cramér, Lemma 1, Section 32.2 of [C9]). The 
maximum-likelihood estimates of à are y(n), and these are also (accord- 
ing to Theorem 3.3 of [G1]) minimax for the simple quadratic loss 
function (l — \)?. But 


(32) Ellyn) — XP | A) = Ey)? | 0) 


1 

> 2n” f y olyn; n) dy 
5(2n)n~# 

= {nj — 5(2n)n~"4] è(2n), 


which does not satisfy (31). Even for the bounded, and therefore more 
realistic, loss function, 


(33) L(;)) = min {1, l — AP}, 


it follows easily from Theorem 3.3 of [G1] that every estimate must 
somewhere incur a loss at least as great as the lower bound established 
by (82). To summarize, there are no estimates efficient in the sense 
of (31), nor even in the sense that would arise from (31) on replacing 
the simple quadratic loss function by a bounded loss function; the se- 
quence of estimates y(n) is efficient in the official sense, so to speak, 
but does not, of course, result in losses of the order of 27}. 

What can be said in positive justification of the criterion of efficiency 
as defined by (31) or the like? Roughly, the elements of such a se- 
quence nearly dominate every estimate for every smooth loss function. 
A little more precisely, for large n, the loss associated with an element 
of a sequence efficient in the sense of (31) is at most larger by a small 
fraction than that of any other estimate, except possibly in some short 
‘ntervals.t ‘The maximum loss of such an element is at most larger by 
à small fraction than the minimax loss, so the elements of the sequence 
are typically nearly minimax. Moreover, they typically have consid- 


int At has actually been demonstrated that the total length of these exceptional 


‘tervals (within any fixed interval) is small [L3]. 
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erably smaller losses than any minimax estimate, except in short inter- 
vals that are typically very improbable a priori in the personal sense. 
Thus the principle of admissibility, the minimax rule, and the personal- 
istic concept of probability combine to suggest that efficiency as de- 
fined by (81) is a promising guide in the search for good estimates. 

An extensive critique of the concept of efficiency, including much 
material on its history, has been given by LeCam in [L3], which unfor- 
tunately was not available to me in its entirety as I wrote this section. 

R. A. Fisher’s name is the most prominent in the history of maximum- 
likelihood estimation and efficiency. Some historical details are given 
in [N4] and on p. 45 of Vol. II of [K2]. 


7 A behavioralistic review, concluded 


Criteria 6 (unbiasedness) and 7 are now the only ones in the list for 
which I have not suggested some justification in terms of the theory of 
decision problems, and, indeed, I cannot. Unbiased estimates fascinate 
many theoretical statisticians, including myself, and the study of them 
undoubtedly has certain valuable by-products. Yet it is now widely 


agreed that a serious reason to prefer unbiased estimates seems never 
to have been proposed. 


Three weak defenses are so 
serted to have an intuitive a 
course, on the experience of 
ingly many unbiased estim. 
virtue, it is a limited one and pertai 


metimes heard. First, unbiasedness is as- 
ppeal; whether it does or not depends, of 
the intuiter. Second, averages of increas- 
pically consistent. If this is a 


ition of other estimates. Third, 
: » tor example, it has been agreed that 
one party will buy a sack of sugar from another at so much per pound, 


be determined by un- 


criterion of unbiasedness it should be 
realized that, even if \ admits an unbiased estimate, many not-at-all 
pathological functions of A (which can in turn be regarded as parame- 
ters), may fail to do so and that such unbiased estimates as à does admit 
may be preposterous. These phenomena are both illustrated by the 
following simple example. Let x be confined to two values, say 1 and 
2; let PU | A) =1— P | à) = A; and let à be confined to the interval 
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[1/3, 2/3]. Then, by definition, 1 is an unbiased estimate of (A), if 
and only if 1(1)A + U(2)(1 — A) = (2) + (1) — 1(2))A = ¢(A)—a con- 
dition that can be met, if and only if ¢ is linear. Suppose, for example, 
(A) = à for every À, then [(1) = 1, 1(2) = 0 defines the only unbiased 
estimate of (à). This estimate is worse, according to an emphatic 
variant of Criterion 3, than the biased estimate I’ such that l’(1) = 2/3 
and l/(2) = 1/3; for I’ (when it errs at all) errs in the same direction as 
1, but never nearly as far. 

As for Criterion 7, it is on first encounter appealing to postulate that, 
if 1 is usually closer to A than I’ is, then 1 is better than l’. But, speaking 
at least for myself, the initial appeal of Criterion 7 seems to have been 
bound up with the conjecture that Criterion 7 is in some sense of the 
same sort as Criterion 3. The example given under Criterion 7 almost 
entirely evaporates the conjecture, and with it the appeal. 

In the paper [P5] in which the criterion is put forward for considera- 
tion and exploration, Pitman mentions that the criterion seems ac- 
ceptable in contexts where “the devil takes the hindmost.” This allu- 
sion to the devil seems to offer no justification for the criterion as a cri- 
terion of estimation, for I understand the allusion to refer only to the 
following kind of decision problem, which is quite remote from estima- 
tion as ordinarily understood and is hardly ever encountered: A person 
must choose between 1 and l’, winning a prize if the estimate of his 
choice falls closer to à than does the other one. 

According to Pitman, the relationship of “better than,” or “closer 
than” as he calls it, defined by Criterion 7, is not necessarily transitive. 
He argues, I think with some justice, that this breakdown of transitivity 
does not in itself invalidate the criterion when the criterion is applied 
to select the “best” from some prescribed class of estimates; but “best” 
cannot here be taken literally. 

Criterion 7 is unusual in that it depends on the joint conditional dis- 
tributions of pairs of estimates rather than on the distributions of each 
estimate considered separately. On any ordinary interpretation of es- 
timation known to me, it can be argued (as it was under Criterion 3) 
that no criterion need depend on more than the separate distributions. 


CHAPTER 16 


Testing 


1 Introduction 


In principle, this chapter on the statistical process of testing (often 
referred to more fully as making tests of hypotheses or significance 
tests) might have been organized on the pattern of the preceding chap- 
ter on point estimation: a statement of verbalistic ideas, followed by 


oralistic ideas. But I am 
several considerations. It 


vioral c their counterparts i 
estimation. Finally, it is inappropriate to attempt anything like a 
complete list of verbalistic criteria for tests here, especially in view of 


the availability of tw mplementary key ref- 
26, and 27 of [K2]; and [L4]). 


ed from a frankly behavioralistie vi i 
. . . . vV V or t. In 
this discussion ideas Pa 


Sars istic tradition are used 
freely, and some criter $ 


of today. 
Terms introduced in boldfac 


i t 0 e in this chapter are among the most 
frequent in ordinary statistical 


usage. The definitions given are in- 
246 
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tended to be in reasonable accord with that usage, but some small con- 
cessions are made to the particular form in which the theory of testing 
is expressed here. 


2 A theory of testing 

Verbalistically, the problem of testing means to guess, on the basis 
of observation, which of two disjoint and mutually exhaustive hypoth- 
eses obtains. Behavioralistically, this would generally be agreed to 
point to the definition: A t:sting problem is an observational decision 
problem derived from exactly two basic acts fy and fı. These two basic 
acts are called (for a reason that will soon be clear) accepting and re- 
jecting the null hypothesis, respectively. 

Considered abstractly as bilinear games, testing problems may, so 
far as I know, have no special feature beyond the uninteresting one 
that one of two f’s is appropriate to each 7. But, considered as obser- 
vational problems, testing problems do present some interesting special 
features. In the first place, since at least one of the two basic acts is 
appropriate to each 7, the set J of all ?s can be partitioned into three 
Sets, Ho, H,, and N, defined thus: 


L(fo; 7) =0 and L(f;7)>0 forze Ho, 
(1) Ld >0 and L(f,j2)=0 forieH, 
L(foj) =0 and L(f;7)=0 forieN. 


When it is recalled that the čs correspond to a partition B; of S, the 
Sets Ho, H,, and N may, with a slight clash of logical gears, be regarded 
as three events partitioning S. The traditional names of Ho and H, 
are the null and the alternative hypothesis, respectively; N, being quite 
unimportant and often either ignored or made vacuous by some trick 
of definition, has no such name. Rejecting the null hypothesis when it 
does in fact obtain and accepting it when it does not obtain are called 
errors, more specifically errors of the first and second kind, respec- 
tively, 

A test is a derived act of a testing problem. A test may conveniently 
be identified with the real-valued contraction z of the observation x 
Such that z(x) is the probability prescribed by the test for rejection of 
the null hypothesis in case x is observed. An unmixed test (which was 
until recently the only kind contemplated) corresponds to a z confined 
to the two values 0 and 1, which respectively imply outright acceptance 


and rejection of the null hypothesis. 
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The loss associated with the test z when 7 obtains is clearly 


(2) LG; i) = L(fo; DEN — z| 4) + L(t; DBE li 
= Lf; )E(z| i) for i e Ho 
= Lf Ð — Ez|H] fori eH, 
=0 forzeN. 


i)] are, respectively, the proba- 
null hypothesis with the test z 


They are commonly called the power 
teristic, respectively. 
In view of (2), 


function, and operating charac- 


one test z dominates another z’, if and only if 
M Beld <E@'\i) foi e Ho 
E@\i)>E@’|i) fori eH; 


or, again, if and only if the probability of error with 7’ is at least as 
great as with z for every 7. 


Lf; i) = [1 — Bejm 


for i e Hy 

(4) , =D elsewhere; 
Li;i) = B(z| for i e Hy 

F elsewhere, 
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and the minimax rule lead to no criteria expressible solely in terms of 
Ho, Hy, and the conditional distributions of the observation x other 
than that of admissibility itself. Whether some other objectivistic prin- 
ciple could justify such criteria may be considered an open question, 
but, as I have already said (in § 15.1), no other general objectivistic 
principles have been seriously maintained. 

It is natural, for example, to demand that z have the same symmetry 
as P(x | i) and Ho and Hj; but that criterion can surely not be justified 
at all, unless the basic loss is also assumed to have the same symmetry, 
the justifiability of which in turn depends on the case. 

To take another important example, it is often proposed that a satis- 
factory test must be unbiased, that is, its power function must never 
be higher in Ho than in Hı. More formally, the test z is unbiased, if 
and only if 


(5) El | io) < Ee |i) 


for every to ¢ Ho and every ù ¢ Hy. 

Assuming that L(foọ; i) and L(f,; i) are constant in H, and Ho, re- 
Spectively, it will be shown that any minimax must be unbiased. As a 
step toward that demonstration, consider a testing problem as a mini- 
max problem, without any special assumption about the basic loss. 
It is possible that L* = 0, in which case the minimax tests are all equiv- 
alent and all unbiased. Putting that possibility aside, I assert, and will 
show, that (under the usual mathematical simplifications) 
(6) max L(z; 7) = max L(z; i) = L* 

ie Ho ie Hi 

for any minimax z. It is obvious that neither maximum exceeds L*, 
and also that one or the other must equal L*. But suppose, for exam- 
ble, that the second maximum were actually less than L*, and consider 
Z = az with 0 <a@ <1. According to (2), if z’ is substituted for z, 

€ first maximum in (6) will be depressed, and, for a sufficiently close 
to 1, the second would remain actually less than L*, which contradicts 

€ assumption that z is minimax, establishing (6). 
ow make the special assumption that 
(7) L(fo; i) SA for 7 ¢ Hy 
L;i) =B fori e Ho, 


and suppose that z could be minimax but biased. There would then 


TA definition unifying the various concepts of unbiasedness in statistics is put 
ward in [L5]. 
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exist ip ¢ Ho and 7, e Hı such that 
(8) L* = L(z; io) = BE(z | io) = A — AE(z | ù) = LE; i), 


and such that E(z; io) > E(z; i). But consideration of the test that 


simply assigns to every x the number 8 midway between H(z; ip) and 
E(z; i) shows that z could not be minimax. 


The condition (7) is a reasonable assumption in some testing problems, 
and, where (7) is satisfied, the criterion of unbiasedness has such sup- 
port as the minimax rule can give. In many other typical testing prob- 
lems, however, there are borderline errors that hardly matter at all but 
can scarcely be prevented, and serious errors that can largely be pre- 
vented. The following example, which can be varied to suit diverse 


tastes, shows that it can be folly to insist on unbiasedness in such 
problems. 


Let i take the three values 0, 1, 2, and let x take the values 0 and 1 
with conditional probabilities defined thus: 


(9) P| 0) = 99/100, P(0|1) =0, 


Let the basic loss be defined by the condition that i e Ho or i e Hy, ac- 
cording as i = 0 or not, and by 


(10) Li; 0) = 1, 
Then 


P(0| 2) = 1. 


Lfo;1)=1,  LÇfo; 2) = 1/101. 


L(z; 0) = [992(0) + 2(1)]/100 


(11) L(z; 1) = 1 — 2(1) 

L(z; 2) = [1 — z(0))/101. 

It is easily verified that the only minimax z* is defined by 2*(0) = 0, 
= 1/101 for every i. But it 


incorporated into the very definition of a test. Though many impor- 
tant tests happen to have a size, others equally important do not; 50 
it now seems to be recognized [L4] that the Possession of a size cannot 


pem 
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be taken seriously as a criterion. To take an everyday example, con- 
sider the binomial distributions 


OL 
(12) P(x|p) = g Jra — p)”, 


where the parameter p confined to [0, 1] plays the role of ¿ and x = 0, 
-++, 101; and suppose that Ho is the hypothesis that p < 1/2. A test 
of size æ is a test for which 


(13) Dd z(z) (~ pa — p= = a 


for all p < 1/2. This obviously implies 


0 poa 


ed L=p 


for all p < 1/2, whence z(x) = æa for every 2. So only absurd tests 
have size, in this example, though there are clearly tests here that are 
quite satisfactory for many applications, for example, let z(x) equal 0 
or 1 according as x < 50 or x > 50. 

In view of the criticism just made, there is a tendency to redefine 
size so that any test has a size a, namely, 
(15) a = pe max E(z| 2). 

ie Ho 

In terms of this definition of size, a concept of testing somewhat differ- 
ent from that proposed in this section has been defined and defended 
(Wald, p. 21 of [W3], and Lehmann, pp. 17-18 of [L4]; namely, it is 
postulated that a test is to be chosen not from among all possible tests, 
but only from among those having a size æ (in the sense of (15)) given 
as part of the testing problem.{ This concept of testing is not defended 
to the exclusion of the one proposed here, but it is asserted by the 
authors cited to be more realistic for some problems. The arguments of 
both authors on this point are similar and, I think, quite weak in two 
crucial places, for the advantage is supposed to flow in some unspeci- 
fied way from the undemonstrated impossibility of comparing prefer- 
ences for consequences of qualitatively different kinds. It seems, if I 
may be allowed such a conjecture, that the concept of testing under a 


t Statisticians interested in the Behrens-Fisher problem may be interested in pp. 
35.173a-b of [F6], which hinge on the question of size as a criterion. i 

tThe constraint, actually imposed, especially by Lehmann [L4], is that the size 
be at most œ. But, as Lehmann explains, this difference is more apparent than real. 
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constraint of size represents a Procrustean attempt to fit the (older) 
Neyman-Pearson theory of testing hypotheses too closely with the 
(newer) minimax theory. It is not to be denied, of course, that there 
may sometimes be a mathematical advantage in studying and compar- 
ing tests of given size. : 

It should be mentioned, before concluding the subject, that any the- 
ory taking size seriously introduces an asymmetry of the theory with 
respect to Ho and H;, an asymmetry that is surely not always appropri- 

te. 
É Significance level, or level of significance, is a synonym (neglecting 


a slight distinction made in [L4]) of size, probably more widely used 
than size itself. 


3 Testing in practice 


The theory of testing admits some fairly realistic applications, but 
the present state of statistics is such that the theory of testing is in- 
voked more often than not in problems on which it does not bear 


squarely. This section discusses typical applications of the theory, 
pointing out the shortcomings I am aware of. 


The development of the theory of testing has been much influenced 
by the special problem of simp 


le dichotomy, that is, testing problems 
in which Ho and H, have exactly one element each. Simple dichotomy 
is susceptible of neat and full analysis (as in Exercise 7.5.2 and in 
§ 14.4), likelihood-ratio tests here being the only admissible tests; and 
simple dichotomy often gives insight into more complicated problems, 
though the point is not explicitly illustrated in this book. 

Coin and ball examples of simple dichotomy are easy to construct, 
but instances seem rare in real life. The astronomical observations 
made to distinguish between the Newtonian and Einsteinian hypotheses 
are a good, but not perfect, example, and I suppose that research in Men- 
delian genetics sometimes leads to others. There is, however, a tradi- 
tion of applying the concept of simple dichotomy to some situations to 
which it is, to say the best, only crudely adapted. Consider, for ex- 
ample, the decision problem of a Person who must buy, fo, or refuse to 
buy, fı, a lot of manufactured articles on the basis of an observation x. 
Suppose that 7 is the difference between the value of the lot to the per- 
son and the price at which the lot is offered for sale, and that P(t | 2) is 
known to the person. Clearly, Ho, Hy, and N are sets characterized 
respectively by 7 > 0, i < 0,7=0. This analysis of this, and similar, 
problems has recently been explored in terms of the minimax rule, for 
example by Sprowls [S16] and a 1 


ittle more fully by Rudy [R4], and by 
Allen [A3]. It seems to me natural and promising for many fields of 
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application, but it is not a traditional analysis. On the contrary, much 
literature recommends, in effect, that the person pretend that only two 
values of 7, čo > 0 and å < 0, are possible and that the person then 
choose a test for the resulting simple dichotomy. The selection of the 
two values čo and č is left to the person, though they are sometimes 
supposed to correspond to the person’s judgment of what constitutes 
good quality and poor quality—terms really quite without definition. 
The emphasis on simple dichotomy is tempered in some acceptance- 
sampling literature, where it is recommended that the person choose 
among available tests by some largely unspecified overall consideration 
of operating characteristics and costs, and that he facilitate his survey 
of the available tests by focusing on a pair of points that happen to in- 
terest him and considering the test whose operating characteristic 
passes (economically, in the case of sequential testing) through the 
pair of points. These traditional analyses are certainly inferior in the 
theoretical framework of the present discussion, and I think they will 
be found inferior in practice. 

To make a small digression, there is a complication in connection with 
testing whether to buy that is not ordinarily envisaged by statistical 
theory; namely, the economic reaction between the buyer and the sup- 
plier. If, for example, the supplier knows the test the buyer is going 
to apply, that knowledge will influence the quality of the lot supplied. 
There seems to be little, if any, successful work on the economic prob- 
lem thus raised about the game-like behavior of the two people involved 
(cf. pp. 331, 340, and 346 of [W6]). 

The problem whether to buy a lot obviously has many formal coun- 
terparts in other domains. In some of them it is particularly clear that 
purely objectivistic methods do not suffice. To illustrate, imagine two 
experiments: one designed to determine whether it is advantageous to 
add a certain small amount of sodium fluoride to the drinking water of 
children, the other to determine whether the same amount of oil of 
peppermint is advantageous. Granting that each of the two additions 
can be made at the same cash cost for labor and material and that the 
designs of the two hypothetical experiments differ only in the inter- 
change of the roles of sodium fluoride and oil of peppermint, the corre- 
sponding testing problems are objectivistically completely parallel, that 
is, the same with regard to loss function and conditional probability of 
the observations. But it must be acknowledged, I think, that the people 
actually charged with the decision in either of these two cases would 
and should take into account opinions they had before the observation. 
For example, they might originally have considered it nearly impossible 
that the oil of peppermint could result in any hygienic advantage large 
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enough to compensate for even the small cost of its administration, but, 
in view of recent dental researches on the subject, they might not have 
considered it at all unlikely that the sodium fluoride should have an 
overall advantage. In that case, parallel observations in the two ex- 
periments would not always lead to parallel decisions. Objectivists 
typically admit such a possibility but go on to say that it is unreasonable 
to isolate the experiment and that it is the totality of information bear- 
ing on the subject that should be treated objectivistically. If objectiv- 
ists could give a more detailed discussion of how to deal with such a 
totality of information, it might do much to clarify their position. 

I turn now to a different and, at least for me, delicate topic in connec- 


tion with applications of the theory of testing. Much attention is given 
in the literature of statistics to what 


purport to be tests of hypotheses, 
in which the null hypothesis is such that it would not really be accepted 


by anyone. The following three Propositions, though playful in con- 


tent, are typical in form of these extreme null hypotheses, as I shall call 
them for the moment. 


A The mean noise out 


put of the cereal Krakl is a linear function of 
the atmospheric pressure, 


in the range from 900 to 1,100 millibars. 


B The basal metabolic consumption of sperm whales is normally 
distributed [W11]. 


Cc New York taxi drivers of Irish, J 


ewish, and Scandinavian extrac- 
tion are equally proficient in abusive | 


anguage. 
Literally to test such hypotheses as these is preposterous, If, for ex- 


ample, the loss associated with f} is zero, except in case Hypothesis A 
is exactly satisfied, what possible ex 


$ dontan sa perience with Krakl could dissuade 
you from adopting f? 


potheses is perfectly well 
rd maxim that science dis- 
The role of extreme hypotheses 
in science and other statistical activities seems to be important but ob- 
scure. In particular, though I, like everyone who practices statistics, 
have often “tested” extreme hypothes 
tory analysis of the process, nor say ¢ 
as defined in this chapter and other t 
less, it seems worth while to explore 
so largely in terms of two examples. 
Consider first the problem of a cereal dynamicist who must estimate 
the noise output of Krakl at each of ten atmospheric pressures between 
900 and 1,100 millibars. It may well be that he can properly regard the 


heoretical discussions. None the 


the subject tentatively; I will do 
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problem as that of estimating the ten parameters in question, in which 
case there is no question of testing. But suppose, for example, that 
one or both of the following considerations apply. First, the engineer 
and his colleagues may attach considerable personal probability to the 
possibility that A is very nearly satisfied—very nearly, that is, in terms 
of the dispersion of his measurements. Second, the administrative, 
computational, and other incidental costs of using ten individual esti- 
mates might be considerably greater than that of using a linear formula. 
It might be impractical to deal with either of these considerations very 
rigorously. One rough attack is for the engineer first to examine the 
observed data x and then to proceed either as though he actually be- 
lieved Hypothesis A or else in some other way. The other way might be 
to make the estimate according to the objectivistic formulae that would 
have been used had there been no complicating considerations, or it 
might take into account different but related complicating considera- 
tions not explicitly mentioned here, such as the advantage of using a 
quadratic approximation. It is artificial and inadequate to regard this 
decision between one class of basic acts or another as a test, but that 
is what in current practice we seem to do. The choice of which test 
to adopt in such a context is at least partly motivated by the vague 
idea that the test should readily accept, that is, result in acting as though 
the extreme null hypotheses were true, in the farfetched case that the 
null hypothesis is indeed true, and that the worse the approximation of 
the null hypotheses to the truth the less probable should be the ac- 
ceptance. 

The method just outlined is crude, to say the best. It is often modi- 
fied in accordance with common sense, especially so far as the second 
consideration is concerned. Thus, if the measurements are sufficiently . 
precise, no ordinary test might accept the null hypotheses, for the ex- 
periment will lead to a clear and sure idea of just what the departures 
from the null hypotheses actually are. But, if the engineer considers 
those departures unimportant for the context at hand, he will justifiably 
decide to neglect them. 

Rejection of an extreme null hypothesis, in the sense of the foregoing 
discussion, typically gives rise to a complicated subsidiary decision 
problem. Some aspects of this situation have recently been explored, 
for example by Paulson [P3], [P4]; Duncan [D11], [D12]; Tukey [T4], 
[T5]; Scheffé [S7]; and W. D. Fisher [F7]. 

To summarize abstractly, I would say that, in current practice, so- 
called tests of extreme hypotheses are resorted to when at least a little 
credence is attached to the possibility that the null hypothesis is very 
nearly true and when there is some special advantage to behaving as 
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though it were true. One other illustration will make it clear that point 
estimation is not essential to the situation and that belief in the approxi- 
mate truth of the null hypothesis alone does not always justify testing. 

Consider the personnel manager of a great New York taxi company. 
Wishing, of course, that his drivers should be as proficient as possible, 
he would, under simple circumstances, hire exclusively from the na- 
tional-extraction group that had obtained the highest mean scores in a 
standard proficiency examination; for why should he not be guided by 
a positive indication, however slight? A statistical test of the extreme 
Hypothesis C would not, therefore, be called for, 
out in general terms by Bahadur and Robbins [ 
lief that ethnic differences are extremely 
would not alone be any reason for dep 
dictated by the principle of admissibili 
ample framed around Hypothesis A. 
shortage of labor, or administrative diffi 
crimination at all, the manager may re: 
amination scores. 


In practice, tests of extreme hypotheses are typically chosen from a 
relatively small arsenal of standard types, or families, each family con- 
sisting of one unmixed test at every significance level (as size is always 
called in this context). it is standard practice not 


as has been pointed 
B3]. Even strong be- 
small in the respect in question 
arting from this simple policy, 
ty—quite in contrast to the ex- 
If, however, public opinion, a 
culty militates against any dis- 
sort to a test based on the ex- 


ory of extreme hypotheses is 
ntext of the two-sided t-test. 


CHAPTER 17 


Interval Estimation 


and Related Topics 


1 Estimates of the accuracy of estimates 


The doctrine is often expressed that a point estimate is of little, or 
no, value unless accompanied by an estimate of its own accuracy. This 
doctrine, which for the moment I will call the doctrine of accuracy esti- 
mation, may be a little old-fashioned, but I think some critical discus- 
sion of it here is in order for two reasons. In the first place, the doctrine 
is still widely considered to contain more than a grain of truth. For 
example, many readers will think it strange, and even remiss, that I 
have written a long chapter (Chapter 15) on estimation without even 
suggesting that an estimate should be accompanied by an estimate of 
its accuracy. In the second place, it seems to me that the concept of 
interval estimation, which is the subject of the next section, has largely 
evolved from the doctrine of accuracy estimation and that discussion 
of the doctrine will, for some, pave the way for discussion of interval 
estimation. 

The doctrine of accuracy estimation is vague, even by the standards 
of the verbalistic tradition, for it does not say what should be taken 
as a measure of accuracy, that is, what an estimate of accuracy should 
estimate. Any measure would be rather arbitrary; a typical one, here 
adopted for definiteness, is the root-mean-square error, 


a) Bu- OP|B) = (Val Bd + EU] B) — OF, 


using (15.6.23). The root-mean-square error reduces to the standard 
deviation, V*(1 | B,), in case the estimate 1 is unbiased. 

Taking the doctrine literally, it evidently leads to endless regression, 
for an estimate of the accuracy of an estimate should presumably be 
accompanied by an estimate of its own accuracy, and so on forever. 

Even supposing that the doctrine were somehow purged of vagueness 
and endless regression, it would still be in clear conflict with the be- 
havioralistic concept of estimation studied in Chapter 15. If a decision 
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problem consists in deciding on a number in the light of an dbservation, 
the person concerned wants to adopt an 1 that is, in some sense or 
other, as good as possible; but, since he must make some decision, it 
could at most satisfy idle curiosity to know how good the best is— 
idle, I say, because, his decision once made, there is no way to use knowl- 
edge of its accuracy. . . 

Since it seems to me that the kind of problem envisaged in Chapter 
15 is of frequent occurrence and may properly be called estimation, 
I am inclined to say that the doctrine of accuracy estimation is errone- 
ous. However, it is possible that someone should point out a different 
class of problems, also properly called problems of estimation, with re- 
spect to which the doctrine has some validity; though, so far as I know, 
this has not yet occurred. 

One sort of situation that might, through what I would consider 
faulty analysis, seem to support the doctrine of accuracy estimation is 
illustrated by the following, highly schematized example. A person 
has to estimate the number n of replacement parts of a ce 
that should be carried by an expedition. He can conduct 
outcome of which will, let us say, 


the Poisson distribution with mean equal to acn; that is, 


(2) P(e |n) = e-*°"(acn)*/x!, 
where æ is a known constant and c, whi 
cost (beyond overhead) of the trial. Under reasonable hypotheses, 
once c has been chosen and the value x observed, n(x) = x/ac is a good 
estimate of n; and in so far as the problem is of the type envisaged in 
Chapter 15, that is the end of the matter. 

But there may be features of the 
stated, though in principle they sho 
may be that the person is free to co: 


rtain sort 
a trial the 
be an observation x distributed in 


ch the person can choose, is the 


problem that have not yet been 
uld have been. 


ong so. One rough, but sometimes 
natural and practical, step toward idi 

called for is to remark that (i/ac)¥4 
square error of n and may give a fairly good basis on which to judge 
whether the risk of misestimation warrants the expense of a second 
trial. 

My own conviction is that we should frankly regard such a problem 
as has just been described as a special problem in sequential analysis 
and treat it as an organie whole. Viewed thus, c is to be chosen in the 
light of the possibility of making a second trial. The decision to be 
based on 2 is the complex one of w 


; hether to go to the expense of a second 
trial; if so, of what magnitude; and, if not, what estimate of n to adopt. 
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Another sort of situation that seems to have stimulated the doctrine 
of accuracy estimation is the following. Suppose that a research worker 
has observed xj, +-+, Xn, Which are independent and normally distributed 
about the mean p with variance o? given u and o. If he wishes to pub- 
lish the results of his investigation for all concerned to use as their own 
needs and opinions may dictate, he should, ideally, publish a sufficient 
statistic of his observation, stating how it is distributed given x and ø. 
Any other course may deprive some reader of some information he 
might be able to put to use. So far as the primary aim is concerned, all 
sufficient statistics are equivalent, but secondary considerations greatly 
narrow the research worker’s choice. To illustrate, consider the five 
sufficient statistics the values of which for {2, +++, tn} are: 


(a) (21, ee Zn}. 

(b) The n order statistics of {a, +--+, tn}. 

(c) >> a; and >) z. 

(d) € =pr > 2;/n and ? =p; (22-4 DY 2)/n - 1. 


(e) € and s/n*. 


If n is at all large, (c), (d), and (e) are cheaper to publish than (a) 
and (b). Moreover, for almost any use to which a reader might wish 
to put the data, (c), (d), and (e) will save him a considerable amount 
of computation. In so far as it is true that almost any reader who has 
a use for the data at all will use Z, but not necessarily >> x; statistics 
like (d) and (e) are slightly preferable to (c). There is something to be 
said both for (d) and for (e), in view of the ready availability of certain 
tables; but, at least when n is very large, there is a slight advantage to 
(e) for those calculations a reader is most likely to perform. In par- 
ticular, a reader using (e) can, when n is large, often ignore the actual 
value of n. Even if the distributions of the x), ---, Xn are not exactly 
normal, (c), (d), and (e) often can play almost the same role as suffi- 
cient statistics. It is no wonder then that (e) is often chosen as a con- 
venient way to present data. But, in my opinion, it is a mistake to 
lay great theoretical emphasis on the fact that (e) happens to consist 
of what is ordinarily a good estimate of u, namely #, together with what 
is ordinarily a good estimate of the root-mean-square error of that es- 


timate, namely s/n”. 
2 Interval estimation and confidence intervals 


The verbalistic tradition has suggested a procedure different from 
point estimation but somehow related to it. This other procedure, here 
called interval estimation, can be defined as follows, though the defini- 
tion is necessarily vague. Where x is an observation subject to the 
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conditional distributions P(x | B,) and A(z) is a function of i, guess 
that (2) lies in some set M(x) (to be called an interval estimate) de- 
termined for each value of x. It is almost a part of the definition to 
say that the function M(x) is to be so chosen that P(A(i) e M(z) | Bi) 
shall be nearly 1 for every 7 and that M (x) should tend to be small and 
“close knit” in a geometrical sense, some compromise being effected be- 
tween these two conflicting desiderata. The parameter \(z) could in 
principle be a very general function, but it will here be enough to sup- 
pose for definiteness and simplicity that (2) is real. Though more 
general possibilities are contemplated in principle, the set M(x) is in 
practice typically a bounded interval, which corresponds with what I 
meant in saying that M (x) is supposed to be “close knit.” 

The idea of interval estimation is 


a) PA eM(z)|d) = a, 


where a is constant and almost equal to 0.95. 

It is usually thought necessary to warn th 
tion as (1) does not concern the probabilit; 
lies in a fixed set M (x). OF course, 
in the context at hand; and, given 
which is a contraction of x 


need not be altogether rejected, but that interval estimation satisfies 
a parallel need. 

The first part of the explanation Just cited is Specious, since no one 
really expects a point estimate to be correct, and since, when one really 
is obliged by circumstances to make a point estimate in the behavioral- 
istic sense, there is no escaping it. None the less, that part of the ex- 
planation does seem to give some insight into the appeal of interval es- 
timation. The second part of the explanation is a Sort of fiction; for it 
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will be found that whenever its advocates talk of making assertions that 
have high probability, whether in connection with testing or estima- 
tion, they do not actually make such assertions themselves, but end- 
lessly pass the buck, saying in effect, “This assertion has arisen accord- 
ing to a system that will seldom lead you to make false assertions, if 
you adopt it. As for myself, I assert nothing but the properties of the 
system.” 

From the behavioralistic point of view, I maintain that point estima- 
tion fulfils an important function. On the other hand, I can cite no 
important behavioralistic interpretation of interval estimation. More- 
over, in such direct and indirect contact as I have had with actual sta- 
tistical practice, I have—with but one extraordinary exception, which 
will soon be discussed—encountered no applications of interval estima- 
tion that seemed convincing to me as anything more than an informal 
device for exploring data or crudely summarizing it for others. In 
short, not being convinced myself, I am in no position to present con- 
vincing evidence for the usefulness of interval estimation as a direct 
step in decision. The reader should know, however, that few are as 
pessimistic as I am about interval estimation and that most leaders in 
statistical theory have a long-standing enthusiasm for the idea, which 
may have more solid grounds than I now know. 

The following is a schematized example of one sort of decision prob- 
lem that does call for something like interval estimation. An observa- 
tion x bears on the position à of a lifeboat, the occupants of which will 
be saved or lost, according as the boat is or is not sighted by a search- 
ing aircraft before nightfall. The decision problem is, therefore, to 
choose, from all the domains that the airplane could search in time, one 
domain M(x); and the loss must, in effect, be reckoned as 0 or 1 accord- 
ing as M(x) does or does not contain A. This type of problem seems, 
however, too rare and too special to be taken as representative of those 
for which interval estimation is so widely advocated. 

Many criteria have been put forward for interval estimation, but I 
am of course in no position to discuss them critically. J. Neyman has 
gone about the search for criteria systematically, setting up a parallel- 
ism between the theory of interval estimation and of testing. In par- 
ticular, paralleling the criterion of fixed size for tests, he has emphasized 
interval estimates such that 


2) PAG) e M(x) | B) = æ 


for a fixed a (typically close to 1) and for every 7. Such interval esti- 
mates are called confidence intervals at the confidence level a. The 
interval estimate mentioned in connection with (1) is obviously a con- 
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fidence interval. Wald [W3] sought to include the theory of confidence 
intervals in the minimax theory, but in my opinion he did not succeed 
in giving interval estimation a behavioralistic interpretation. 

Though I am in no position to criticize any criterion of interval es- 
timation, I venture to ask whether (2) is not gratuitous, as I have more 
positively asserted of its analogue in the theory of testing. 


Chapters 19 and 20 of [K2] will serve as key references for interval 
estimation. 


3 Tolerance intervals 


There has recently been considerable study of what are called toler- 
ance intervals (or limits). They are related to the problem of guessing 
the actual value of a real random variable y, on the basis of an obser- 
vation of x. A tolerance interval for y at tolerance level æ and confi- 
dence level £ is an interval-valued function Y(x) such that 


(1) PIP £ Y(2) | Ba x) > a| B] = 8 
for every i. . 
The concept expressed by (1) is a sli 


to express it in words thus: For every B 
such that y will fall in Y(e) wi 


ppery one; perhaps it will help 
i, there is probability £ that x is 
ith probability at least a, given B; and 
x. In typical applications y is independent of x ; this permits a slight 
simplification of the definition. The notion of tolerance interval seems 
to me at least as unamenable to behavioralistic interpretation as that 
of confidence interval, and I therefore venture no discussion of it here. 
Key references are [B22] and [W7]. 


4 Fiducial probability 


This is not really a section on fid 
apology for not having such a section. The concept of fiducial proba- 
bility put forward and stressed by R. A. Fi 
technical concept of modern statistics, 
concerned with interval estimation, I wanted to discuss it here. I 
have, however, been privileged to see certain as yet unpublished manu- 
scripts of R. M. Williams [W12] and J. w. Tukey which convince me 
that such discussion by me now would be premature. 

Some key references to fiducial probability and to the Behrens-Fisher 
problem, which is the most disputed field of application of fiducial 
probability, are Fisher’s own papers, especially [F5], and Papers 22, 
25, 26, 27, and 35 of the collection [F6]; Kendall [K2], Chapter 20; 
Yates [Y1]; Owen [O1]; Segal [S9]; Bartlett [B6]; Scheffé [S6], [S5]; 
Walsh [W9]; and Chand [C5]. 


APPENDIX 1 


Expected Value 


This appendix, a brief account of some relatively elementary aspects 
of the badly named mathematical concept, expected value, is presented 
for those who might otherwise be handicapped in reading this book. 
No proofs are given here, but the reader who needs this appendix will 
probably be willing and able to accept the facts cited without proof, 
especially if he acquires intuition for the subject by working the sug- 
gested exercises. The requisite proofs are, however, given implicitly 
in any standard work on integration or measure (e.g., Chapters I-V of 
[H2]). 

Throughout this appendix, let S be a set with elements s and subsets 
A, B, C, +++ on which a (finitely additive) probability measure P is 
defined. Bounded real random variables, that is, bounded real-valued 
functions, defined for each s S, will here be denoted by x, y, +--+, and 
real numbers by 2, y, z, and lower-case Greek letters. 

The expected value of x, generally written E(x), is characterized as 
the one and only function attaching a real number to every bounded 
random variable x, subject to the following three conditions for every 
x, y, p, o, and B: 


a) E(px + oy) = pE(x) + cE(y). 
(2) E(x) >0 whenever P(x(s) < 0) = 0. 
(3) E(c(| B)) = P(B). 


In (3), c(| B) is the characteristic function of B, that is, e(s| B) = 1, 
if s ¢B, and c(s| B) = 0, if se ~B. In mathematical contexts remote 
from the topics in this book, the term “characteristic function” has at 
least two other meanings virtually unconnected with the one at hand, 
one in connection with linear operators on function spaces, and another 
in connection with the Fourier analysis of distributions. 

Often the expected value of x is referred to as the integral of x over 
S, in which case it is generally written fx(s) dP(s). 
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Exercises 


1. If x takes only a finite number of values, 2, 


“++, Tn, except on a 
set of probability zero; then 


(4) E(x) = 3 t:P(x(s) = xj), 
i=l 


that is, the average of the xs, each weighted by the probability of its 
occurrence. pa 7 

2. If P(x(s) < y(s)) = 0, E(x) > E(y); and if, in addition, P(a(s) > 
y(s) + €) > 0 for some e > 0, then E(x) > E(y).+ 

3. If x is a real random variable, B; a partition, p; and o; real numbers 
such that p; < «(s) < c; forse B;, then 


(5) Zp:P(B;) < E(x) < o;P(B). 
4. c(| A N B) = c(| A)c(| B), 
c(| ~4) = 1 ~ c(4), 
c(| A U B) = c(| A) + c(| B) — c(| A)e(| B). 


As is explained in texts on measure theory, 
(at least for countably additive measures), and 
tended to many unbounded random variables. 

Since, provided P(B) > 0, the conditional probability, 
P(C | B)=P(CN B)/P(B), is itself a probability measure, the ex- 
pectation of x with respect to a conditional probability is a meaningful 
concept. This conditional expectation is written E(x| B) and read 
“the expected value of x given B.” 


the expected value can 
in practice must, be ex- 


defined by 


More exercises 

5. E(x| B) = E(xc(| B))/P(B). Hint: It suffices to verify that the 
expression on the right satisfies the three conditions parallel to (1-3) 
that define E(x | B). 


6. If B; is a partition of S, then 
(6) De(s|B) =1 for every s. 


7. E(x) = È Eœ | B)P(B). Hint: Use x = 


k t Technical note: I 
implies the existence 


lx. 


n the event that P is countably additive, P(x(s) > 


‘ y(s)) > 0 
of a suitable ¢, So then e need not be mentioned at all. 
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Suppose y is a (not necessarily real) random variable that takes on 
only a finite number of values. It will be understood that E(x | y) is 
the expected value of x given that y(s) = y, provided y is such that 
this event has positive probability. Furthermore, it will be understood 
that E(x | y) is a bounded real random variable that for each s takes 
the value E(x | y(s)). The definition leaves E(x | y) undefined on the 
null set of those points s where y(s) is a value that y takes on with prob- 
ability zero. It is immaterial how this blemish is removed; in particu- 
lar E(x | y) may as well be set equal to 0, where it has not already been 
defined. 


Still more exercises 

8. E(E(h| y)) = E(h). 

9. If f is a real-valued function defined on the values of y; then f(y) 
is a bounded real variable, and 


(7) E(f(y)x) = E(f(y)E(& | y)). 
10. If h(x) is such that, for all f, 
(8) E(f(y)x) = E(F@)hO)), 


then h(y(s)) = E(x | y(s)), except possibly on a set of s’s of probability 
zero. 


Exercise 9 and its corollary, 8, present the most frequently used prop- 
erties of conditional expectation. Exercise 10 shows that the property 
presented in 9 characterizes conditional expectation. Through this 
characterization Kolmogoroff [K7] extends the ideas of conditional ex- 
pectation and also of conditional probability (for countably additive 
measures) to random variables y not necessarily confined to a finite or 
even denumerable set of values; though the definition in terms of ordi- 
nary conditional probability then breaks down completely, the proba- 
bility that y(s) = y often being 0 for every y. 


APPENDIX 2 


Convex Functions 


This appendix gives a brief account of convex functions in the same 
spirit as the preceding one gives an account of expected value. Reason- 
able facsimiles of the proofs omitted here are scattered through [H4], 
where they may be found by anyone not content to skip them. 

An interval is a set I of real numbers; such that, if x, z e Z and z < y 
< z, then y eT. It is not difficult to see that intervals can be classified 
according to Table 1, where it is to be understood that x < z. 


TABLE 1. THE VARIOUS TYPES OF INTERVALS 


The set of 
Symbolic real y’s 
designation such that 


Verbal description 


(—%, +) y=y The infinite interval (the set of 


all real numbers) 


(a, +) z< s) 
(=, 2) >y Open 
Te z half-infinite intervals 
[os % cNy 
Cex, al 2> x Closed 
te 3 T<y<z Open 
T, z rsy<z 
Ge. A me 4 < J Half-open } bounded intervals 
Ix, z] tSySez Closed 
[x, x] za=y One-point intervals 
y<y 


The vacuous interval (the vacu- 
ous set) 


A real-valued function t defined for z in an interval J is convex, if 
and only if the graph of the function never rises above any chord of it- 
self. Analytically, if p and ¢ are positive, p+ o = 1, and z, y e I; then 


(1) Up + oy) < pt(x) + ot(y). 
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If equality holds in (1) for some p; then, as is easily verified, it holds 
for every p, and t is linear, i.e., of the form az +, in the closed 
interval [x, y]. An interval in which t is linear will here be called an 
interval of linearity. If and only if there are no intervals of linearity 
other than the one-point and vacuous intervals, t is strictly convex. 


Exercises 

1. Verify, at least graphically, that the following functions are con- 
vex in the indicated intervals; discuss their intervals of linearity; and 
say which are strictly convex. 


Lie (He, Fe) 

(a) e” for every p, (b) 2? + px + o for every p and v, 
() ||, (d) |x|? for p > 1, 

(e) x. 

I = (0, œ): 

(f£) —log z, (g) x? for —œ < p < 0. 

I= (=1; +1) 

m) d-2)"4 (i) 1 — cos (wx/2). 


2. In an interval where t is convex, if d?¢(x)/dx? exists at 2, then 
d?t(x)/dx? > 0; and if, for every x in an interval J, dé(x)/dz? exists and 
is non-negative, then t is convex in J. 

3. Re-explore Exercise 1 in the light of 2. 


4. Let T be a non-vacuous set of functions, t, t’, ---, convex in J, 
and let 
(2) t*(s) = sup ¢(s). 
t 


In (2), as always in mathematics, the sup, or supremum, of a set of 
numbers is the least number, possibly ©, that is not less than any ele- 
ment of the set. If ¢*(s) < © for every s eZ, then t* is convex in 7. 
Explore the proposition just stated, first graphically, especially for a 
finite set of linear t’s, and then analytically. What if the elements of 
T are all strictly convex? 

5. In an open interval where t is convex, it is also continuous. What 
are the facts for closed and half-closed intervals? 
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6. If t is convex in J, x €I, p, > 0, and Zp, = 1, where k = 1, ---, 
r; then 


(3) x prt(z;) > (£ mre): 


Equality obtains, if and only if all the z;,’s are in a single interval of 
linearity of t. 


(a) Interpret the propositions above in terms of probability. 
(b) Prove them by arithmetic induction on r. 
(c) What if t is strictly convex? 


Exercise 6 suggests, and indeed proves a special case of, the following 


well-known and most useful theorem, which cannot be proved here in 
full generality. 


THEOREM 1 If t is convex and bo 


unded in the interval 7, and a(s) eI 
for all s S, then 


(4) E(u(x)) > (E(@)). 


Equality obtains, if and only if the values of x are with probability one 
contained in a single interval of linearity of t. Here and throughout this 
appendix, such conditions for equality are to be understood to apply 
only in the event that either P is countably additive or the random 
variable is with probability one confined to a finite set of values; the 


general situation for finitely additive measures is a little more compli- 
cated. 


More exercises 


7. The variance of x, often written V(x), is defined thus: 


(5) V@) = E(x ~ Ep), 
Show that 
(6) V(x) = Ek?) — R(x) > 0, 


with equality if and only if P(a(s) = E(x)) = 1, 


8. Show that, if x is never smaller than some positive number, 


(7) log E(x) < E(log x) < log E(x). 
When can either equality obtain? Wri 


by (3), and show thereby that (7) is 
fact that the arithmetic mean (of posit: 


te the analogue of (7) suggested 
a generalization of the familiar 
Ave numbers) is at least as great 


+o 
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as the geometric mean and the geometric mean is at least as great as 
the harmonic mean. 


One of the most famous of all inequalities is the Schwartz inequality, 
which can, though not quite obviously, be derived from Theorem 1, 
and which can be stated in terms of expected values thus: 


(8) E? (xy) < E(x’) E(y’), 


with equality obtaining if and only if for some numbers p and o not 
both zero 


(9) P(px(s) = oy(s)) = 1. 


Note that (9) expresses (perhaps too compactly) that, except on some 
set of probability zero, either x or y vanishes identically or else each is 
a fixed multiple of the other. 

Statistically speaking, the Schwartz inequality expresses, in effect, 
the familiar fact that any correlation coefficient must lie between +1 
and —1, one of the extremes occurring if and only if at least one of the 
two random variables involved is a linear function of the other. 

The concept of convex functions and its implications can easily be 
extended to real-valued functions defined on vectors in an n-dimensional 
vector space, the role of intervals there being replaced by convex sub- 
sets of the vector space; but an understanding of this extension, though 
desirable, is not absolutely essential in reading this book. 

One good introduction to convex subsets of vector spaces is Sections 
16.1-2 of [V4], and another especially adapted to statistical applica- 
tions is incorporated in [B18]. The standard treatise on the topic is 
that of Bonnessen and Fenchel [B20]. 
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Bibliographic Material 


The bibliography of about 170 items that terminates this appendix 
lists not only all works referred to in this book but also some others, 
for it is intended to serve not only as a mechanical aid to reference but 
also as a briefly and informally annotated list of suggested readings in 
the foundations of statisties. In addition to the notes incorporated 
into the bibliography, information about many of the works listed there 
is given in other parts of the book, where it can be found by referring 
to the author’s name in the author index. 

Some readers may be interested in refer 
ized bibliographies than the one given he 
are for their guidance, 

Todhunter has abundant references scattered 
through [T3], emphasizing the mathematical asp 
through the period of Laplace. 
ography which purposely does ni 
extensively, the emphasis being 
ability and on the period betw 
[C1] also gives a formal bibliography, 


since Keynes. Carnap promises an even fuller bibliography in the 
projected second volume of his work, and he recommends the bibliog- 
raphy of Georg Henrik von Wright in [V5]. 

Bibliographies of statistic: 


ring to larger or more special- 
re. The next few paragraphs 


in chronological order 
ects of probability up 
Keynes, in [K4], gives a formal bibli- 
ot overlap Todhu 
on more philosophi 
Carnap in 
which emphasizes publications 


of [K2]. Carnap at the beginning of 
some other statistical bibliographies. The enormous work of O. K. Bu- 
ros in statistical bibliography, [B23], [B24], and [B25], should also be 
mentioned. His volumes bring togeth, i 

of statistical books. Buros also directed a bibli 
entitled “Statistical Methodology,” in thi 
tistical Association from September 1945 
rent articles, books, theses, and chapte: 
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Volume 20 (1949) of the Annals of Mathematical Statistics, an important 
journal of statistical theory, there are two cumulative indexes of Vol- 
umes 1-20, one arranged by author, the other by subject. 
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Error, mean Square, 224 
see also Root-mean square error and 
Squared error 
of an estimate, definition of, 227 
Errors of first and second kind, 140, 
247 
Estimation, interval, 259 
point, 2208 
definition of, 221 
Estimation decision problem, 229ff 
Event, complement, of, 11 
definition of, 10 
examples of, 10 
generic symbols for, 11 
null (or virtually impossible), 24 
universal, 10 
vacuous, 10 
Events, almost equivalent, 37 
containing, 11 
equal, 11 
intersection of, 11 
union of, 11 
Expectation, conditional, 264 
Expected value, 263ff 
definition of, 263 
Experience, 44, 46, 55, 62 
Experiment and observation, 117, 118 
Extension, of an observation, 112 


of a set of acts, 113 
Extreme B, 129 
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Factorability criterion for sufficiency, 
130ff 

Fair coin, 33 

Fiducial probability, 262 

Fine, 37, 40 

Foundations of sciences, role of, 1 

Foundations of statistics, deep, 5 
history of, 1ff 
shallow, 5 


Gamble, 70, 71 

Gambling, 63, 64, 91, 94 

Gambling apparatus, 66 

Game, abstract, 184ff 
bilinear, 186ff 
standard, 178ff 
two-person, 178ff 

Games, in relation to minimax theories 

of decision, 180ff 

mathematics of, 184ff 
theory of, 156, 178ff 

Given, 22, 44 

Grand world, 84 

Greek fonts, 11 

Group, mathematical, 193 

Group action, 105 

Group decision problem, 172ff 
and observation, 210 

Group minimax rule, 207 


Hausdorff moment problem, 53, 55, 152 
Homogeneous coordinates, 136 
Hyper-utility, 75 
Hypothesis, alternative, 247 

extreme null, 254 

null, 247 


Income, 163 

negative, 164, 169, 170 

and loss, 182, 200 

personal, 173 
Inconsistency, 20, 21, 57 
Indecision, 21 
Independence in qualitative probability, 

44, 91 

Independent events, 44 
Independent random variables, 46 
Indifference, 17, 59 

difficulty of testing, 17 
Inductive behavior, 159 


Inductive inference, 2 
Inexact science, 59 
Infimum, 80 
Infinite sets in applied mathematics, 39, 
77 
Infinite utility, 81 
Information, 50, 153, 235ff 
differential, 236ff 
Information inequality, 238 
Insufficient reason, principle of, 64, 65, 
193 
Integral, 263 
Interrogation, behavioral, 28 
intermediate mode of, 28 
strictly empirical, 28, 29 
Intersection of events, 11 
Interval, 266 
Interval estimation, 257 
definition of, 259, 260 
Interval of gambles, 75 
Interval of linearity, 267 
Invariance of a game, 194ff 
Invariant minimax, 197, 198 
Irrelevant, 126 
utterly, 126 
Irrelevant event, 44 


Journal of American Statistical Associa- 
tion, 270 
Judgment, 156 


Large numbers, strong law of, 54 
weak law of, 49, 54, 91 
Learning, 44, 55 
see also Experience 
Lebesgue measure, 41 
Likelihood ratio, 48, 135ff, 225 
Likelihood-ratio test, 139, 213 
Linear function, 267 
Logic, 3 
decision and, 6 
empirical interpretation of, 20 
criticism of, 20 
incompleteness of, 59 
normative interpretation of, 20 
Logical behavior, implications of, 7, 8, 20 
“Look before you leap principle,” 16 
criticism of, 16, 17 
Loss, 163, 164, 169, 170 
personal, 174 
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Loss, uniformity of, 166, 174 
Loss and negative income, 182, 200 


Marginal utility, 103, 104 
diminishing, 94ff 
Mathematical expectation, principle of, 
91, 92 
Maximin, 184 
Maximum-likelihood estimate, 140, 203, 
222ff, 241 
definition of, 225 
Mean-square error, 224 
see also Root mean-square error and 
Squared error 
Measurable random variable, 45 
Median, 228 
Microcosm, 86 
Minimax, 184 
Minimax act, 164 
Minimax equality, 179, 187 
Minimax estimate, 232, 240, 241 
Minimax rule, 157, 180ff 
and simple ordering, 205 
group, 174ff, 207 
objectivistic, 164 ff 
definition of, 164 
illustrations of, 164ff 
objectivistic motivation of, 168, 169 
Minimax rules, criticism of, 200ff 
Minimax test, 249, 250 
Minimax theories, mathematics of, 184ff 
Minimax theory, 156 
objectivistic, definition of, 165 
objectivistie approach to, 158ff 
Minimax theory and observation, 208 
Minimax value, 164 


Mixed act, 162, 163 
in group decision problem, 173 
Mixed acts in Statistics, 213, 216, 2176 
Mixture of gambles, 71 
Moment problem, Hausdorff 


, 53, 55, 
152 

Moral expectation, 93, 94 

Moral worth, 93ff 

Multipersonal considerations, 129, 124, 


126, 127, 148, 154ff, 1728 
see also Agreement, Certainty, and 
Disagreement 
Multiple observation (or statistic), 111 
counting of, 133 


Necessary statistic, 137, 224 
Necessary views of probability, 3, 60, 61, 
67 
Negative income, 164, 169, 170 
and loss, 182, 200 
Neyman-Pearson school, 140 
Neyman-Pearson theory of testing, 252 
non-Archimedean probability, 39 
Normal distribution, 132, 222 
Normative interpretation, of postulates, 
19ff 
of theory of utility, 97 
Normative theory, 102 
Nuisance parameter, 223 
Null event, 24, 26 
Null hypothesis, 247 
extreme, 254 
Null observation, 112 


Objectivistic decision problem, 159 
Objectivistic observational problem, 208 
Objectivistic views of probability, 3, 60, 
61, 67, 253, 254 
central difficulty of, 4 
probability of isolated propositions 
under, 4 
Observation, 105ff, 125 ff 
cost of, 116, 118, 169, 214, 215 
decision after, 23 
definition of, 110 
Observational problem, objectivistic, 208 
Observation and experiment, 117, 118 
Observed value, 110 
Obtains, 10 
Operating characteristic, 248 
Optimism, 68 
Order Statistic, 132 


Parameter, 221 
nuisance, 223 
Partial ordering, 21 
Partition, 24 
almost uniform, 34 
Partition formula, 45 
Partition problems, 120ff 
Personalistic view, 56 
difficulties with, 57 
Possible incompleteness of, 59 
Personalistic views of probability, 3, 67 
Personal Probability, 27, 30 
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Personal probability, criticism of verbal- 
istic approach to, 27, 28 
other terms for, 30 
Person as economic unit, 8 
Pessimism, 68 
Plan as a single decision, 16 
criticism of, 16, 17 
Point estimation, 220ff 
definition of, 221 
Poisson distribution, 222 
Power function, 248 
Preference, 17 
as simple ordering, 18 
as partial ordering, 21 
conditional, 22 
superfluous for consequences, 25, 26 
irreflexivity of, 17 
transitivity of, 18 
Preference among consequences, 25 
distinguished from preference among 
acts, 25 
Pre-statistics, 5 
Primary act, 163 
Prize, 31 
Probabilities of higher order, 58 
Probability, mathematical properties of, 
2,3 
unknown, superfluousness of in person- 
alistic theory, 50, 51 
views on, dualistic, 2, 51, 62, 63 
necessary, 3, 60, 61, 67 
objectivistic, 3, 60, 61, 67, 253, 254 
personalistic, 3, 67 
see also Personalistic view 
Probability measure, 33 
Probability space, 45 
Propositions, probability of, under ob- 
jectivistic views, 4, 27, 61, 62 
Pseudo-microcosm, 86 
Psychological probability, 30 


Qualitative probability, definition of, 32 
example, 28 
fine but not tight, 41 
neither fine nor tight, 41 
tight but not fine, 41 
Quantitative probability, 33 


Randomization, 66, 163, 216, 217 
Random numbers, 67 


Random variable, 45 
real, 263 
Rational behavior, 7 
Ray, 135 
Regret, 163 
Rejecting, 247 
Root-mean-square error, 257 
see also Mean-square error and Squared 
error 


St. Petersburg paradox, 93ff 
Schwartz inequality, 269 
Science, almost exact, 101 
Sequential analysis, 116, 142ff, 215, 216 
Sequential observational program, 142 
Sequential probability ratio procedure, 
146 
Significance level, 252 
reporting of, 256 
Significance tests, 246ff 
Simple dichotomy, 138, 145, 146, 148, 
212, 213, 252 
Simple ordering, 18 
and the minimax rule, 205 
exercises on, 19 
Size of a test, 250 
Small world, 9, 16, 82ff 
Squared error, 81, 234 
see also Mean-square error and Root 
mean-square error 
Standard deviation, 257 
Standard game, 178ff 
Standard sequence of observations, 227 
State, 9 
true, 9 
States, generic symbols for, 11 
Statistic, 128 
Statistics, other names for, 2 
scope of, 2 
Statistics proper, 5, 105, 114, 121 
definition of, 154 
Strategy function, 111 
Strictly convex function, 267 
Subjective probability, 30 
Sufficient statistic, 129ff, 212, 224, 230, 
237, 246, 256, 259 
factorability criterion for, 130ff 
Supremum, 80, 267 
Sure personal probabilities, 57, 58, 66 
Sure-thing principle, 21ff, 114, 207 
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Symmetric dual, 78 i 
Symmetric sequence of events, 50ff 
Symmetry, 232, 246 

in probability, 63ff 

of games, 193ff 


Tastes, 155 

Team mate, 132 

Test, definition of, 247 

of hypotheses, 246ff 

Testing, 221 

Testing problem, 247 

Ties in rank, 219 

Tight, 37, 40 

Time in theory of decision, 10, 17, 23, 
44 

Tolerance interval, 262 

Tolerance level, 262 

Topological assumptions possible for a 
simple ordering, 18 

Transitivity, 19 

True state, 9 


Unbiased estimate, 203, 224, 244, 245 
definition of, 226 

Unbiased test, 249 
criticism of, 250 

Uniform distribution, 131 

Union of events, 11 

Universal event, 10 
symbol for, 11 

Utile, 82 


Utility, 69 
and the minimax rules, 201ff 
bounded, 95 
criticism of, 91ff 
definition of, 73 
history of, 91ff 
logarithmic, 94, 95 
probability-less, 91, 95, 96 
Utterly irrelevant observation, 126, 212, 
237 


Vacillation, 21 
Vacuous event, 10 
symbol for, 10, 11 
Vagueness, 59, 168, 169 
Value of observation, 151 
Variance, 268 
Venn diagram, 12 
Verbalistic and behavioralistic outlooks, 
17 
Verbalistic outlook, 159ff, 220, 260, 261 
inadequacy of in definition of personal 
probability, 27, 28 
Virtual extension, 148 
Virtually equivalent acts, 148 
Virtually impossible event, 24 


World, choice of, 9 
definition of, 9 
examples of, 8 
grand, 84 
small, 9, 16, 82ff 
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